Condor Grid Monitor is not running


Running Grid Monitor is considered part of the best practices for running a large batch of jobs on a grid.

It is not always apparent from the client side whether Grid Monitor is running or not. The best way to determine if Grid Monitor is running from client side is to check the log file: /tmp/GridManagerLog.<username> and look for Grid Monitor messages.

Syntax:

If Grid Monitor succesfully starts on <site>:

8/31 16:35:08 [6927] Successfully started grid_monitor for <site>

If Grid Monitor fails on <site>:

9/5 12:17:13 [17880] Error with grid_monitor for <site>, stopping.

>9/5 12:17:13 [17880] Giving up on grid_monitor for site <site>. Will retry in 3600 seconds (60 minutes)

>9/5 12:17:13 [17880] Stopping grid_monitor for resource <site>

Known causes:

If jobs stream stdout or stderr then Grid Monitor cannot run. Also, the following values should evaluate as follows:

$ condor_config_val grid_monitor

>$VDT_LOCATION/condor/sbin/grid_monitor.sh

>$ condor_config_val enable_grid_monitor

> TRUE

Note: even if you have checked all of the above we have occasionally seen a bug on the server side which prevents Grid Monitor from running. The bug seems to occur after a site upgrades their OSG installation and there are anomalies in jobmanager-fork (e.g. a missing newline and -rdn attributes set incorrectly). If this is the case you will have to contact the site administrator.

Recommendations:

In all jobs that don't require streaming stdout and stderr include the following:

stream_output = false

>stream_error = false

Also, in $VDT_LOCATION/condor/etc/condor_config make certain GRID_MONITOR and ENABLE_GRID_MONITOR are set correctly.

See also