Jobs switch between idle/running in queue

Behavior

A job submitted via Condor is added to a grid's Condor queue in an idle state. A administrator logged in to the grid can observe the job switching to the running state for some amount of time and then back to idle. The job does so infinitely often without completing.

Problem

This rogue job permanently occupies one queue slot until it is killed by the user or the admin. This behavior prevents non-faulty jobs from making use of that slot.

Known Causes

This behaviour has been observed when submitting 64bit executable binary files to a 32bit cluster. This issue is described here.

Potential Other Causes

The user who submitted this Condor job did so via ssh and then logged out before the job completed running. It is possible the disappearance of the client caused the job to toggle between running and idle.

Recommendations

Condor could check the number of times a job has run and returned to the idle state and after a threshold has been reached place the job on hold, allowing other jobs an opportunity to use that queue slot.