I have 10 servers running on Ubuntu 14.04 x64. Each server has a few Nvidia GPUs. I am looking for a monitoring program that would allow me to view the GPU usage on all servers at a glance.
Monitoring a GPU cluster
clustergpumonitoringnvidiasoftware-rec
Related Solutions
Changing the xorg.conf file to add virtual X servers for each of the cards (even those not connected to a monitor) solved the issue.
Basically, you want to have a server layout section with all of your real and virtual screens:
Section "ServerLayout"
Identifier "Layout0"
# Our real monitor
Screen 0 "Screen0" 0 0
# Our virtual monitors
Screen 1 "Screen1"
Screen 2 "Screen2"
# ....
Screen 3 "Screen3"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Then, for each your cards, you can put in (almost) identical "Monitor", "Screen" and "Display" sections, differing only by their identifiers, which in the following are N
, but should be repaced by the card number, 0
,1
, etc. Note that at least the parameters for the real monitor should correspond to what you currently have in your xorg.conf
file, i.e. in the following I have CRT
since it's an old VGA monitor.
Section "Screen"
Identifier "ScreenN"
Device "DeviceN"
Monitor "MonitorN"
DefaultDepth 24
Option "ConnectedMonitor" "CRT"
Option "Coolbits" "5"
Option "TwinView" "0"
Option "Stereo" "0"
Option "metamodes" "nvidia-auto-select +0+0"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Monitor"
Identifier "MonitorN"
VendorName "Unknown"
ModelName "CRT-N"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "DeviceN"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "Your Card name here"
BusID "PCI:X:Y:Z"
EndSection
I "kind of" know a little bit about this due to running both a Plex server and a ML instance on the same bare metal server a few years back. CUDA 5.5 added the option to set stream priories at the driver level enabling a end-user to schedule priorities just like other activities in Linux. At a GUI level Nvidia added "performance modes" in the settings around 2016. Here's an article I found detailing this: http://ubuntuhandbook.org/index.php/2016/04/switch-intel-nvidia-graphics-ubuntu-16-04/ . I don't know if this will help at all but figured I'd share what helped me.
Best of luck!
Best Answer
You can use the ganglia monitoring software (free of charge, open source). It has number of user-contributed Gmond Python DSO metric modules, including a GPU Nvidia module (
/ganglia/gmond_python_modules/gpu/nvidia/
).Its architecture is typical for a cluster monitoring software:
(source of the image)
It's straightforward to install (~ 30 minutes without rushing), except for the GPU Nvidia module, which lacks clear documentation. (I am still stuck)
To install ganglia, you can do as follows. On the server:
Choose
Yes
each time you are asking a question about ApacheFirst, we configure the Ganglia server, i.e.
gmetad
:In
gmetad.conf
, make the following changes:Replace:
by (assuming that
192.168.10.22
is the IP of the server)It means that the Ganglia should listen on the 8649 port (which is the default port for Ganglia). You should make sure that the IP and the port is accessible to the Ganglia clients that will run on the machines you plan to monitor.
You can now launch the Ganglia server:
You can access the web interface on http://192.168.10.22/ganglia/ (where
192.168.10.22
is the IP of the server)Second, we configure the Ganglia client (i.e.
gmond
), either on the same machine or another machine.In
gmond.conf
, make the following changes so that the Ganglia client, i.e.gmond
, points to the server:Replace:
to
Replace
by
Replace:
to
You can now start the Ganglia client:
It should appear within 30 seconds in the Ganglia web interface on the server (i.e., http://192.168.10.22/ganglia/).
Since the
gmond.conf
file is the same for all clients, you can add the ganglia monitoring on a new machine in a few seconds:I used the following guides:
A bash script to start or restart
gmond
on all the servers you want to monitor:deploy.sh
:servers.txt
:Screenshots of the main page in the web interface:
https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html gives a nice overview of the Ganglia Web Interface: