Linux – Writing triggers for mcelog

hardwarelinuxmcelogmonitoringrhel

Just starting to look into mcelog for the first time (I've enabled it and seen syslog output before, but this is the first time I'm trying to do something non-default). I'm looking for information on how to write triggers for it. Specifically, I'm looking for what kinds of events mcelog can react to, how it decides which scripts to execute, and so on. Best I can make from the example trigger is that it sets a bunch of environmental variables before invoking the script. So does it just try to execute everything in the trigger directory (which is /etc/mcelog on RHEL) and let the script decide what it wants to act on?

I've seen other trigger scripts with names that look like MCE events, is that convention or does that have a special function? I created a trigger called /etc/mcelog/joel.sh which just sends a basic email to my gmail account. A few days ago apparently the trigger went off because I got an email from the script without manually running the script. I didn't think to pipe env output to the mailx command in joel.sh so I don't know what hardware event triggered the script execution or why mcelog picked joel.sh as the script to execute for it.

Basically, I'm looking for an answer that will give me a basic orientation with mcelog, it's triggering system, and how I can use it to monitor my hardware health. I'm pretty sure I can figure out the more advanced stuff once I get my bearings.

Best Answer

Looking at the sample mcelog.conf config file it looks to contain most if not all of the types of triggers it can deal with.

DIMMs

[dimm]
#
# execute these triggers when the rate of corrected or uncorrected
# errors per DIMM exceeds the threshold
# Note when the hardware does not report DIMMs this might also
# be per channel
# The default of 10/24h is reasonable for server quality·
# DDR3 DIMMs as of 2009/10
#uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
#ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

Sockets

[socket]
# Threshold and trigger for uncorrected memory errors on a socket
# mem-uc-error-trigger = socket-memory-error-trigger
mem-uc-error-threshold = 100 / 24h
# Threshold and trigger for corrected memory errors on a socket
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h

Cache

[cache]
# Processing of cache error thresholds reported by Intel CPUs
cache-threshold-trigger = cache-error-trigger

Page

[page]
# Memory error accouting per 4K memory page
# Threshold for the correct memory errors trigger script
memory-ce-threshold = 10 / 24h
# Trigger script for corrected errors
# memory-ce-trigger = page-error-trigger

Triggers

Triggers can be controlled in this section.

[trigger]
# Maximum number of running triggers
children-max = 2
# execute triggers in this directory
directory = /etc/mcelog

Sample triggers

There are some sample triggers here on the mcelog github page.

Sample trigger script, dimm-error-triggers:

#!/bin/sh
#  This shell script can be executed by mcelog in daemon mode when a DIMM
#  exceeds a pre-configured error threshold
# 
# environment:
# THRESHOLD     human readable threshold status
# MESSAGE   Human readable consolidated error message
# TOTALCOUNT    total count of errors for current DIMM of CE/UC depending on
#       what triggered the event
# LOCATION  Consolidated location as a single string
# DMI_LOCATION  DIMM location from DMI/SMBIOS if available
# DMI_NAME  DIMM identifier from DMI/SMBIOS if available
# DIMM      DIMM number reported by hardware
# CHANNEL   Channel number reported by hardware
# SOCKETID  Socket ID of CPU that includes the memory controller with the DIMM
# CECOUNT   Total corrected error count for DIMM
# UCCOUNT   Total uncorrected error count for DIMM
# LASTEVENT Time stamp of event that triggered threshold (in time_t format, seconds)
# THRESHOLD_COUNT Total umber of events in current threshold time period of specific type
#
# note: will run as mcelog configured user
# this can be changed in mcelog.conf

logger -s -p daemon.err -t mcelog "$MESSAGE"
logger -s -p daemon.err -t mcelog "Location: $LOCATION"

[ -x ./dimm-error-trigger.local ] && . ./dimm-error-trigger.local

exit 0

References

Related Question