Monitoring a GPU cluster

clustergpumonitoringnvidiasoftware-rec

I have 10 servers running on Ubuntu 14.04 x64. Each server has a few Nvidia GPUs. I am looking for a monitoring program that would allow me to view the GPU usage on all servers at a glance.

Best Answer

You can use the ganglia monitoring software (free of charge, open source). It has number of user-contributed Gmond Python DSO metric modules, including a GPU Nvidia module (/ganglia/gmond_python_modules/gpu/nvidia/).

Its architecture is typical for a cluster monitoring software:

enter image description here

(source of the image)

It's straightforward to install (~ 30 minutes without rushing), except for the GPU Nvidia module, which lacks clear documentation. (I am still stuck)


To install ganglia, you can do as follows. On the server:

sudo apt-get install -y ganglia-monitor rrdtool gmetad ganglia-webfrontend

Choose Yes each time you are asking a question about Apache

enter image description here

First, we configure the Ganglia server, i.e. gmetad:

sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf

sudo nano /etc/ganglia/gmetad.conf

In gmetad.conf, make the following changes:

Replace:

data_source "my cluster" localhost

by (assuming that 192.168.10.22 is the IP of the server)

data_source "my cluster" 50 192.168.10.22:8649

It means that the Ganglia should listen on the 8649 port (which is the default port for Ganglia). You should make sure that the IP and the port is accessible to the Ganglia clients that will run on the machines you plan to monitor.

You can now launch the Ganglia server:

sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart

You can access the web interface on http://192.168.10.22/ganglia/ (where 192.168.10.22 is the IP of the server)

Second, we configure the Ganglia client (i.e. gmond), either on the same machine or another machine.

sudo apt-get install -y ganglia-monitor

sudo nano /etc/ganglia/gmond.conf

In gmond.conf , make the following changes so that the Ganglia client, i.e. gmond, points to the server:

Replace:

cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

to

cluster {
name = "my cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

Replace

udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
}

by

udp_send_channel {
# mcast_join = 239.2.11.71
host = 192.168.10.22
port = 8649
ttl = 1
}

Replace:

udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}

to

udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# bind = 239.2.11.71
}

You can now start the Ganglia client:

sudo /etc/init.d/ganglia-monitor restart

It should appear within 30 seconds in the Ganglia web interface on the server (i.e., http://192.168.10.22/ganglia/).

Since the gmond.conf file is the same for all clients, you can add the ganglia monitoring on a new machine in a few seconds:

sudo apt-get install -y ganglia-monitor
wget http://somewebsite/gmond.conf # this gmond.conf is configured so that it points to the right ganglia server, as described above
sudo cp -f gmond.conf /etc/ganglia/gmond.conf
sudo /etc/init.d/ganglia-monitor restart

I used the following guides:


A bash script to start or restart gmond on all the servers you want to monitor:

deploy.sh:

#!/usr/bin/env bash

# Some useful resources:
# while read ip user pass; do : http://unix.stackexchange.com/questions/92664/how-to-deploy-programs-on-multiple-machines
# -o StrictHostKeyChecking=no: http://askubuntu.com/questions/180860/regarding-host-key-verification-failed
# -T: http://stackoverflow.com/questions/21659637/how-to-fix-sudo-no-tty-present-and-no-askpass-program-specified-error
# echo $pass |: http://stackoverflow.com/questions/11955298/use-sudo-with-password-as-parameter
# http://stackoverflow.com/questions/36805184/why-is-this-while-loop-not-looping


while read ip user pass <&3; do 
  echo $ip
  sshpass -p "$pass" ssh $user@$ip  -o StrictHostKeyChecking=no -T "
  echo $pass | sudo -S sudo /etc/init.d/ganglia-monitor restart
  "
  echo 'done'
done 3<servers.txt

servers.txt:

53.12.45.74 my_username my_password
54.12.45.74 my_username my_password
57.12.45.74 my_username my_password
‌‌ 

Screenshots of the main page in the web interface:

enter image description here

enter image description here

https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html gives a nice overview of the Ganglia Web Interface:

enter image description here

Related Question