Monitoring a GPU cluster

clustergpumonitoringnvidiasoftware-rec

I have 10 servers running on Ubuntu 14.04 x64. Each server has a few Nvidia GPUs. I am looking for a monitoring program that would allow me to view the GPU usage on all servers at a glance.

Best Answer

You can use the ganglia monitoring software (free of charge, open source). It has number of user-contributed Gmond Python DSO metric modules, including a GPU Nvidia module (/ganglia/gmond_python_modules/gpu/nvidia/).

Its architecture is typical for a cluster monitoring software:

(source of the image)

It's straightforward to install (~ 30 minutes without rushing), except for the GPU Nvidia module, which lacks clear documentation. (I am still stuck)

To install ganglia, you can do as follows. On the server:

sudo apt-get install -y ganglia-monitor rrdtool gmetad ganglia-webfrontend

Choose Yes each time you are asking a question about Apache

First, we configure the Ganglia server, i.e. gmetad:

sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf

sudo nano /etc/ganglia/gmetad.conf

In gmetad.conf, make the following changes:

Replace:

data_source "my cluster" localhost

by (assuming that 192.168.10.22 is the IP of the server)

data_source "my cluster" 50 192.168.10.22:8649

It means that the Ganglia should listen on the 8649 port (which is the default port for Ganglia). You should make sure that the IP and the port is accessible to the Ganglia clients that will run on the machines you plan to monitor.

You can now launch the Ganglia server:

sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart

You can access the web interface on http://192.168.10.22/ganglia/ (where 192.168.10.22 is the IP of the server)

Second, we configure the Ganglia client (i.e. gmond), either on the same machine or another machine.

sudo apt-get install -y ganglia-monitor

sudo nano /etc/ganglia/gmond.conf

In gmond.conf , make the following changes so that the Ganglia client, i.e. gmond, points to the server:

Replace:

cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

cluster {
name = "my cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

Replace

udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
}

udp_send_channel {
# mcast_join = 239.2.11.71
host = 192.168.10.22
port = 8649
ttl = 1
}

Replace:

udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}

udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# bind = 239.2.11.71
}

You can now start the Ganglia client:

sudo /etc/init.d/ganglia-monitor restart

It should appear within 30 seconds in the Ganglia web interface on the server (i.e., http://192.168.10.22/ganglia/).

Since the gmond.conf file is the same for all clients, you can add the ganglia monitoring on a new machine in a few seconds:

sudo apt-get install -y ganglia-monitor
wget http://somewebsite/gmond.conf # this gmond.conf is configured so that it points to the right ganglia server, as described above
sudo cp -f gmond.conf /etc/ganglia/gmond.conf
sudo /etc/init.d/ganglia-monitor restart

I used the following guides:

A bash script to start or restart gmond on all the servers you want to monitor:

deploy.sh:

#!/usr/bin/env bash

# Some useful resources:
# while read ip user pass; do : http://unix.stackexchange.com/questions/92664/how-to-deploy-programs-on-multiple-machines
# -o StrictHostKeyChecking=no: http://askubuntu.com/questions/180860/regarding-host-key-verification-failed
# -T: http://stackoverflow.com/questions/21659637/how-to-fix-sudo-no-tty-present-and-no-askpass-program-specified-error
# echo $pass |: http://stackoverflow.com/questions/11955298/use-sudo-with-password-as-parameter
# http://stackoverflow.com/questions/36805184/why-is-this-while-loop-not-looping


while read ip user pass <&3; do 
  echo $ip
  sshpass -p "$pass" ssh $user@$ip  -o StrictHostKeyChecking=no -T "
  echo $pass | sudo -S sudo /etc/init.d/ganglia-monitor restart
  "
  echo 'done'
done 3<servers.txt

servers.txt:

53.12.45.74 my_username my_password
54.12.45.74 my_username my_password
57.12.45.74 my_username my_password
‌‌

Screenshots of the main page in the web interface:

https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html gives a nice overview of the Ganglia Web Interface:

Related Solutions

Multi Nvidia GPU overclocking for computations (CUDA)

Changing the xorg.conf file to add virtual X servers for each of the cards (even those not connected to a monitor) solved the issue.

Basically, you want to have a server layout section with all of your real and virtual screens:

Section "ServerLayout"  
    Identifier    "Layout0"     
#   Our real monitor
    Screen 0      "Screen0" 0 0     
#   Our virtual monitors
    Screen 1      "Screen1"     
    Screen 2      "Screen2"
#    ....
    Screen 3      "Screen3"     
    InputDevice   "Keyboard0" "CoreKeyboard"
    InputDevice   "Mouse0"    "CorePointer" 
EndSection

Then, for each your cards, you can put in (almost) identical "Monitor", "Screen" and "Display" sections, differing only by their identifiers, which in the following are N, but should be repaced by the card number, 0,1, etc. Note that at least the parameters for the real monitor should correspond to what you currently have in your xorg.conf file, i.e. in the following I have CRT since it's an old VGA monitor.

Section "Screen"
    Identifier     "ScreenN"
    Device         "DeviceN"
    Monitor        "MonitorN"
    DefaultDepth 24
    Option         "ConnectedMonitor" "CRT"
    Option         "Coolbits" "5"
    Option         "TwinView" "0"
    Option         "Stereo" "0"
    Option         "metamodes" "nvidia-auto-select +0+0"
    SubSection     "Display"
       Depth 24
    EndSubSection
EndSection



Section "Monitor"
    Identifier     "MonitorN"
    VendorName     "Unknown"
    ModelName      "CRT-N"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "DeviceN"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Your Card name here"
    BusID          "PCI:X:Y:Z"
EndSection

User-based GPU priority

I "kind of" know a little bit about this due to running both a Plex server and a ML instance on the same bare metal server a few years back. CUDA 5.5 added the option to set stream priories at the driver level enabling a end-user to schedule priorities just like other activities in Linux. At a GUI level Nvidia added "performance modes" in the settings around 2016. Here's an article I found detailing this: http://ubuntuhandbook.org/index.php/2016/04/switch-intel-nvidia-graphics-ubuntu-16-04/ . I don't know if this will help at all but figured I'd share what helped me.

Best of luck!

Best Answer

Related Solutions

Multi Nvidia GPU overclocking for computations (CUDA)

User-based GPU priority

Related Question