Application Accessibility, API Performance

Incident Report for Dispatch

Postmortem

Summary

On May 1, 2019 starting at about 4:34 PM Eastern, Dispatch experienced a significant system outage to portions of the systems. The most visible impact was the inability to successfully login to both the web and mobile applications. Systems started to recover at 5:22 PM Eastern, but there were lingering issues for some users until approximately 6:55 PM Eastern.

This disruption is completely unacceptable and we have to do better. We take system stability and availability very seriously and strive to maintain 100% uptime and in this case we failed on that goal. We are very sorry for the impact this had on our users and we are committed to doing better.

Root Cause and Follow Up

A server restart during a normal service upgrade led to a bottleneck in a component of our system and that led to a series of cascading error conditions that made it difficult to diagnose and recover. The engineering team was aware of the issue within minutes and was “all hands on deck” addressing the problem. Multiple steps were required to a) diagnose and b) eliminate the bottleneck and cascading error conditions. By 5:22 PM Eastern the system started to recover, but cleaning up the condition that led to the bottleneck caused some lingering effects until we were able to eliminate the condition. In the week since the outage, we have made several improvements that prevent the bottleneck and cascading effects from happening again. We have also implemented improved monitoring that would have helped diagnose the issue faster and implemented process improvements that would allow for much faster recovery.

As we move forward, we strive to ensure that our system is robust and ensure that we maximize up-time. We understand that Dispatch is a crucial part of your daily workflows, and appreciate your patience and understanding.

Posted May 08, 2019 - 10:11 EDT

Resolved

We've updated the status of this incident to resolved

Posted May 01, 2019 - 20:17 EDT

Monitoring

Services are showing signs of normal operation across end-users and Enterprise partners. Access to Mobile and Web applications, including User Login, Job Updates, Job Creation and more, are reporting as expected. We're monitoring the health of the infrastructure until 100% confidence in the full restoration of services.

Posted May 01, 2019 - 19:12 EDT

Update

Service stability & accessibility continues to be investigated by our team. A resolution has not been identified but we are "all hands on deck" to restore operations.

Posted May 01, 2019 - 18:20 EDT

Investigating

Active investigation continues with our Engineering team. Intermittent accessibility may affect users and Enterprises attempting to interact with our infrastructure.

Posted May 01, 2019 - 18:05 EDT

Identified

Users may begin to see accessibility returning to normal operation. Our team is actively continuing to check system monitors and metrics. Attempting to access web apps, mobile apps, and other APIs may be intermittent.

Posted May 01, 2019 - 17:49 EDT

Update

As our team continues to investigate, we recognize Enterprise partners may be experiencing error responses with integrations, job creation attempts and other related requests.

Posted May 01, 2019 - 17:31 EDT

Update

On-going effort by our Engineering team continues in order to restore accessibility to normal operations. We'll provide another update as soon as possible.

Posted May 01, 2019 - 17:11 EDT

Update

We are continuing to investigate this issue.

Posted May 01, 2019 - 16:59 EDT

Investigating

Our team is actively investigating a problem resulting in user access to our applications. We'll provide an update as soon as possible.

Posted May 01, 2019 - 16:44 EDT

This incident affected: Web and Mobile Apps (Mobile Applications, Web Applications).