Notifications Delay
Incident Report for Dispatch
Postmortem

DETAILED DESCRIPTION OF IMPACT

On Wednesday, April 11, 2019, Dispatch incurred a backup of notifications from 4:00 PM till 12:30 AM ET. We started churning through the backup at 12:30 AM ET and all messages were processed by 1:45 AM ET. Unfortunately, this led to customers receiving late notifications.

TIMELINE AND ROOT CAUSE

At 4:00 PM ET, we received an alert that our url-shortener service was having issues. Our url-shortener service is responsible for creating the shortened url links that are sent out in our notifications to customers. This service relies on MongoDB, which is hosted through a third-party provider. The provider had planned maintenance but all of their backup servers had failures as well. We created multiple support tickets in an effort to get more information on what could have been the root cause, but we did not receive a response from them until two hours after the incident began. Due to this, our url-shorteners could not connect to both the primary and backup servers that they had allotted for us. As a result, all notifications that were supposed to be sent out between 4:00 PM and 12:30 AM ET were being queued up. There were some notifications still being sent out within this time period because we still had an intermittent connection with MongoDB, but this was a very small portion of the notifications that we had queued up. At 12:30 AM ET, the provider reported that the remainder of the cluster hosts had been restored and were back online, so we managed to churn through all the backed up notifications by 1:45 AM ET. No notifications were lost throughout this process.

DETECTION, REMEDIATION, AND PREVENTION

We were able to detect this issue due to an alert that notified us about issues with url-shortener. In regards to remediation and prevention, we have new initiatives to change our url-shortener service so that it no longer relies on MongoDB at this particular provider. We will either plan to host our own MongoDB or find a solution that moves us away from the dependency on this provider. Another plan for prevention is to implement some sort of immediacy logic for specific events in order to prevent notifications such as "On My Way" to be sent out if the window of time is no longer relevant.

Posted Apr 12, 2019 - 11:51 EDT

Resolved
This incident has been resolved.
Posted Apr 11, 2019 - 08:19 EDT
Monitoring
Delivery of notifications has been restored to normal operation. Our team has confirmed a small backlog of SMS & Email notifications are currently being processed, expected to finish sending shortly.
Posted Apr 10, 2019 - 18:12 EDT
Identified
Our team has been actively working on addressing the issue affecting notification deliverability. The root problem has been identified.
Posted Apr 10, 2019 - 18:02 EDT
Investigating
An issue has been discovered affecting SMS & Email notification delivery. Our team is working on mitigating the delay and we'll provide an update shortly.
Posted Apr 10, 2019 - 16:50 EDT
This incident affected: Notifications (Email, SMS).