Summary
On May 1, 2019 starting at about 4:34 PM Eastern, Dispatch experienced a significant system outage to portions of the systems. The most visible impact was the inability to successfully login to both the web and mobile applications. Systems started to recover at 5:22 PM Eastern, but there were lingering issues for some users until approximately 6:55 PM Eastern.
This disruption is completely unacceptable and we have to do better. We take system stability and availability very seriously and strive to maintain 100% uptime and in this case we failed on that goal. We are very sorry for the impact this had on our users and we are committed to doing better.
Root Cause and Follow Up
A server restart during a normal service upgrade led to a bottleneck in a component of our system and that led to a series of cascading error conditions that made it difficult to diagnose and recover. The engineering team was aware of the issue within minutes and was “all hands on deck” addressing the problem. Multiple steps were required to a) diagnose and b) eliminate the bottleneck and cascading error conditions. By 5:22 PM Eastern the system started to recover, but cleaning up the condition that led to the bottleneck caused some lingering effects until we were able to eliminate the condition. In the week since the outage, we have made several improvements that prevent the bottleneck and cascading effects from happening again. We have also implemented improved monitoring that would have helped diagnose the issue faster and implemented process improvements that would allow for much faster recovery.
As we move forward, we strive to ensure that our system is robust and ensure that we maximize up-time. We understand that Dispatch is a crucial part of your daily workflows, and appreciate your patience and understanding.