In an effort to provide better overall service, stability, and performance for customers, LogDNA scheduled an infrastructure migration onto a more robust platform on the week of September 4, 2019. During this migration effort, our engineering teams worked around the clock to prepare, test, and perform the migration to make it as seamless as possible.
The following week post-migration on Monday, September 9th, 2019, around 9 am PDT, our production infrastructure began to experience large amounts of live tail, indexing, and alerting delays. Our infrastructure team responded quickly to the automated alerts we received in an attempt to mitigate the delay. During the initial few hours of the incident, our infrastructure team identified two major causes to the delay:
As a direct result, the following services were affected:
To keep the production environment from suffering data loss and expedite the reduction in the delay, our infrastructure team paused all non-critical infrastructure and slowly re-added components as the backlog began to clear up. Engineers also updated and patched our jobs to reduce the potential of a future, similar impact. Finally, the infrastructure team added more hardware to our systems to increase capacity. In all, the worst of the delays cleared on Tuesday, September 10, 2019, at 11:30pm PDT, and the last of the non-critical infrastructure was returned to production by Friday, September 13, 2019, at 01:00am PDT. The additional hardware capacity was ready and deployed by Monday, September 17, 2019 at 4:30pm PDT.
We know that you have a lot of options for logging providers, and we understand that this incident caused a significant negative impact on our customers. Ensuring that there are no disruptions in your business operations are the highest priority for us. We will not sugar-coat this: We messed up. We have since taken the necessary actions to correct our processes to prevent such incidents from happening in the future.