On Wednesday, April 11, 2019, Dispatch incurred a backup of notifications from 4:00 PM till 12:30 AM ET. We started churning through the backup at 12:30 AM ET and all messages were processed by 1:45 AM ET. Unfortunately, this led to customers receiving late notifications.
At 4:00 PM ET, we received an alert that our url-shortener service was having issues. Our url-shortener service is responsible for creating the shortened url links that are sent out in our notifications to customers. This service relies on MongoDB, which is hosted through a third-party provider. The provider had planned maintenance but all of their backup servers had failures as well. We created multiple support tickets in an effort to get more information on what could have been the root cause, but we did not receive a response from them until two hours after the incident began. Due to this, our url-shorteners could not connect to both the primary and backup servers that they had allotted for us. As a result, all notifications that were supposed to be sent out between 4:00 PM and 12:30 AM ET were being queued up. There were some notifications still being sent out within this time period because we still had an intermittent connection with MongoDB, but this was a very small portion of the notifications that we had queued up. At 12:30 AM ET, the provider reported that the remainder of the cluster hosts had been restored and were back online, so we managed to churn through all the backed up notifications by 1:45 AM ET. No notifications were lost throughout this process.
We were able to detect this issue due to an alert that notified us about issues with url-shortener. In regards to remediation and prevention, we have new initiatives to change our url-shortener service so that it no longer relies on MongoDB at this particular provider. We will either plan to host our own MongoDB or find a solution that moves us away from the dependency on this provider. Another plan for prevention is to implement some sort of immediacy logic for specific events in order to prevent notifications such as "On My Way" to be sent out if the window of time is no longer relevant.