Use a query to investigate Scheduler ( aka cron ) delays

Introduction

The SugarCRM Scheduler handles all time based automatons. This includes thre main types of event, events that are expected to process at a date + time, events that are expected to process in batches to keep pace with changes, and events that handle aging/cleanup of records.

Scheduled jobs are queued in a DB table. The table doubles as a log to provide a view into the recent history of Scheduler cycle timing as well as the outcome of each cycle. Examining the history of cycles can help identify delays and failures that can be addressed to avoid downtime or delays for your team and their relationship with your customers

Scheduler Flow

Schedulers create queued jobs with an execution date + time.
On Sugar Cloud the Schedulers for a production ( non-sandbox ) site are expected to trigger every 1 - 4 minutes.
The scheduler trigger is skipped when there is a run in progress.
A scheduler run is made up of a batch of queued jobs that have an execution date.that is before the current time when the schduler batch is created.
That batch of jobs is run one at a time when, if any fail the next run should pick up and attempt the remaining jobs.
The failed job is recorded and marked as done along with a result of failure; a brief annotation of the PHP error is kept.
The next batch will start when the next trigger is run after the current run ends.

Risks of the Flow

Slow running jobs can cause a delay in jobs that are expected to run in a specific time window.
Jobs that fail may go unnoticed and lead to confusing problems.
Jobs share the same resources as your team uses to interact with SugarCRM, this can cause reduced performance for the users or high user activity can lead to a job that fails in an unexpected way.

Scheduler Event History

The metrics kept are date+time the job was queued, the date+time the job was expected to run, and the date+time the batch was finished.

There are also fields for success/failure and notation of any PHP error at time of failure.

The limited history still offers an opportunity to find insights into problems.

The Query

SELECT date_modified "Run End", MAX(TIMESTAMPDIFF(SECOND, execute_time, date_modified)) "Delay (sec)", resolution, count(*) "Batch Size", GROUP_CONCAT(name) "Batch List" FROM job_queue WHERE status = "done" AND date_modified > DATE_SUB(NOW(),INTERVAL 1 DAY) GROUP BY client, date_modified ORDER BY date_modified ASC;

Query Results

The "Run End" is equal to the time when the batch ended. Useful to correlate problems with the result data.

The "Delay (sec)" is the highest number of seconds behind expected time of the jobs in the batch. This is useful to identify jobs that may not be meeting the requirements of your team.

The "resolution" will be "success" or "failure" and any that are failing are worth investigation.

The "Batch Size" is here to confirm the Batch List.

The "Batch List" is the name of the scheduler for each job in the batch.This is useful in finding the scheduler(s) that may need investigation in relation to problems. Note that some jobs may show up more than once in a batch. This is normal in some cases like Elastic Search.

The query is set to collect 24 hours of historic data. The query may time out if more days are requested.

Next Steps

Use an Advanced Report to run the query. Look for high delays, failure resolutions, or jobs that seem to repeat more than expected.

Go into Admin > Schedulers to review the settings for any that are related to undesired results.

Use this query to get the past 24 hours full results for any job based on name.

SELECT * FROM job_queue WHERE name = "name of scheduler" AND date_modified > DATE_SUB(NOW(),INTERVAL 1 DAY);

Share any questions or feedback here so we can help. Open a support case for any urgent concern.