Delays with Candidate Syncs

Incident Report for Jobvite Operational

Postmortem

Date: April 8, 2026
Duration: ~11 hours (01:00 am ET – 12:00 pm ET)

We want to share an update on a recent automation issue, including what happened, how it was resolved, and the steps we’ve taken to prevent it from happening again.

Customer Impact

On April 8, 2026, certain background processing agents responsible for moving data between Talemetry Apply and external Applicant Tracking Systems (ATS) were not running as expected.

As a result:

  • Job and application data imports/exports were delayed.
  • Job status updates were not reflected in near real time.
  • Newly submitted applications were not immediately visible in the external ATS.

No data was lost. All affected jobs and applications were successfully processed once service was restored.

Root Cause

The issue occurred due to a congestion condition in the Mukmuk agent processing system:

  • A large number of agents were scheduled to run, exceeding the capacity of the background worker pods, which were actively handling the workload.
  • Several non-essential agents consumed capacity, blocking higher-priority production agents, and causing the system to enter a stalled state where critical agents could not execute.

In short, the system lacked sufficient safeguards to prevent inactive or non-critical agents from interfering with production workloads during peak processing conditions.

Resolution

Once the issue was identified, the following actions were taken:

  1. Non-essential development agents were disabled.
  2. Background processing capacity was temporarily increased to accelerate backlog processing.
  3. Queued jobs were monitored until all pending imports and exports completed successfully.

By approximately 12:00 ET, all agents had caught up, job processing returned to normal, and customer-facing data reflected up-to-date timestamps confirming restoration.

Preventative Actions

To reduce the likelihood and impact of similar incidents in the future, the following improvements are underway or completed:

  • Proactive Monitoring

    • New alerts are being added to detect stalled or non-running agents in near real time.
    • Monitoring schedules are being expanded beyond limited overnight checks.
  • Improved Observability

    • Additional metrics will track agent execution health and backlog growth.
    • Engineering alerts will trigger automatically when agents fail to run as expected.
  • Operational Safeguards

    • Cleanup of unused or churned agents to prevent resource contention.

These actions will significantly reduce detection, diagnosis, and recovery times should similar conditions arise again.

Posted Apr 13, 2026 - 09:16 PDT

Resolved

Summary:
We experienced an incident affecting the API service that syncs candidate data from our CRM to downstream systems.

Impact:
Customers may have experienced delays in seeing updated candidate information appear in the Jobvite ATS during the incident window.

Affected Service:
Mukmuk API Engine

Timeline:
Start: April 8 at approximately 2:00 AM EDT
Duration: Approximately 8 hours
Posted Apr 07, 2026 - 23:00 PDT