I am trying to setup a cluster of four nodes (all running Fedora 22) with OpenMPI.
On the master node, I've created a password-less key (~/.ssh/id_dsa) and copied ~/.ssh/id_dsa.pub to each of the three slave nodes' ~/.ssh/authorized_keys. So, from the master node, I can run ssh slave1
, ssh slave2
, or ssh slave3
and successfully get into the corresponding node, without being asked for a password. Same goes for ssh master
.
However, I run into permission problems when I try to use mpirun
. Here is the command I run:
/usr/lib64/openmpi/bin/mpirun -np 32 --hostfile .mpi_hostfile ./testprogram
and here is the first bit of the output:
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ORTE was unable to reliably start one or more daemons.
When I subsequently run ssh slave3
, I see the message "There were 2 failed login attempts since the last successful login." So it looks like the ssh authentication that mpirun
is trying to do is failing for some reason.
Any ideas why I can do my password-less, key-based authentication just fine with ssh
, but not with mpirun
?
For the record, here is the contents of .mpi_hostfile
:
# Host file for OpenMPI
# Master node, slots = num cores
localhost slots=8
# Slaves
slave1 slots=8
slave2 slots=8
slave3 slots=8
Best Answer
This is likely because Open MPI defaults to using a tree-based launching scheme. E.g., ssh from the machine where you invoke mpirun to slave1, and then ssh from slave1 to slave2, ...etc.
See http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 for more details.