We've encountered the following situation in the wild:
- A cluster of three nodes is up and running
- One of the nodes loses connection with the HSM
- The admin team does not remove that node from the cluster
- When the services are triggered, the faulty node grabs the job and fails.
- Starting the job resets the timer, meaning that the other nodes will skip running the service
- Due to all three nodes running on the same interval, the faulty node will keep grabbing the service, blocking the others
To fix this, I propose we add a sanity check to each service worker type - we can't stop the faulty node from resetting the timer, but we can set the timer on that node to skip an interval. This should cause one of the other nodes to grab the service the next time the interval hits.