SSH/OpenMPI: Permission Denied with mpirun but fine with ssh

mpissh

I am trying to setup a cluster of four nodes (all running Fedora 22) with OpenMPI.

On the master node, I've created a password-less key (~/.ssh/id_dsa) and copied ~/.ssh/id_dsa.pub to each of the three slave nodes' ~/.ssh/authorized_keys. So, from the master node, I can run ssh slave1, ssh slave2, or ssh slave3 and successfully get into the corresponding node, without being asked for a password. Same goes for ssh master.

However, I run into permission problems when I try to use mpirun. Here is the command I run:

/usr/lib64/openmpi/bin/mpirun -np 32 --hostfile .mpi_hostfile ./testprogram

and here is the first bit of the output:

Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ORTE was unable to reliably start one or more daemons.

When I subsequently run ssh slave3, I see the message "There were 2 failed login attempts since the last successful login." So it looks like the ssh authentication that mpirun is trying to do is failing for some reason.

Any ideas why I can do my password-less, key-based authentication just fine with ssh, but not with mpirun?

For the record, here is the contents of .mpi_hostfile:

# Host file for OpenMPI

# Master node, slots = num cores
localhost slots=8

# Slaves
slave1 slots=8
slave2 slots=8
slave3 slots=8

Best Answer

This is likely because Open MPI defaults to using a tree-based launching scheme. E.g., ssh from the machine where you invoke mpirun to slave1, and then ssh from slave1 to slave2, ...etc.

See http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 for more details.

Related Question