On January 27th, 2020, we experienced a partial outage in our production deployment that caused issues with searching, alerting, graphing, live tail, and overall usability of our product. This is a post mortem to address the root cause, the result of the problem, mediation steps that were taken and action items to prevent this from happening in the future.
A group of nodes within our production environment dropped out of our cluster due to a kernel panic within the OS which also partially took down portions of our ingestion and indexing pipeline. This caused our infrastructure to go into an undesirable state causing customers to experience delays in searching, live tail delays, alerting delays, graphing delays, and overall degraded performance.
Our infrastructure team was notified immediately and began procedures to mitigate the delay while attempting to bring up the nodes that fell out of the cluster. After rebooting the nodes and verifying that pods/containers were at a desirable state, they began to roll the kubernetes pipeline in order to hasten the backlog of logs that were delayed due to the lack of resources.
At this time, our team is continuing to gather data around the cause of the kernel panic in an attempt to better understand the true source of the root cause. They have also taken the necessary steps to make our infrastructure more resilient in regards to nodes dropping out of the cluster regardless of what initial issue caused it.