Indexing lag due to multiple node failures

Incident Report for Mezmo Status Page

Postmortem

On January 27th, 2020, we experienced a partial outage in our production deployment that caused issues with searching, alerting, graphing, live tail, and overall usability of our product. This is a post mortem to address the root cause, the result of the problem, mediation steps that were taken and action items to prevent this from happening in the future.

A group of nodes within our production environment dropped out of our cluster due to a kernel panic within the OS which also partially took down portions of our ingestion and indexing pipeline. This caused our infrastructure to go into an undesirable state causing customers to experience delays in searching, live tail delays, alerting delays, graphing delays, and overall degraded performance.

Our infrastructure team was notified immediately and began procedures to mitigate the delay while attempting to bring up the nodes that fell out of the cluster. After rebooting the nodes and verifying that pods/containers were at a desirable state, they began to roll the kubernetes pipeline in order to hasten the backlog of logs that were delayed due to the lack of resources.

At this time, our team is continuing to gather data around the cause of the kernel panic in an attempt to better understand the true source of the root cause. They have also taken the necessary steps to make our infrastructure more resilient in regards to nodes dropping out of the cluster regardless of what initial issue caused it.

Posted Feb 04, 2020 - 23:02 UTC

Resolved

The incident has been resolved. We will provide a post mortem within 7 business days regarding our findings and if required, action items to prevent this from happening again.

Posted Jan 28, 2020 - 20:36 UTC

Monitoring

Services have stabilized and we will continue to investigate the cause of the node failures.

Posted Jan 28, 2020 - 03:02 UTC

Update

We are continuing to investigate this issue.

Posted Jan 28, 2020 - 01:54 UTC

Update

Indexing delays have been eliminated and the service is current. The team is continuing to monitor the incident.

Posted Jan 28, 2020 - 01:54 UTC

Update

We are continuing to investigate this issue. The current indexing delay is about 50 minutes.

Posted Jan 28, 2020 - 00:56 UTC

Investigating

The indexing delay is currently fluctuating back and forth. Again, the delay will result in delayed search results for a small subset of customers. Our infra team is continuing to triage and investigate the issue. We will provide you with an update within the hour.

Posted Jan 28, 2020 - 00:45 UTC

Monitoring

All services have returned to normal

Posted Jan 27, 2020 - 22:37 UTC

Update

Indexing delays have been eliminated and the service is current. The team is continuing to investigate the root cause of node failures.

Posted Jan 27, 2020 - 22:28 UTC

Update

LogDNA teams are continuing to investigate the root cause.

Posted Jan 27, 2020 - 21:10 UTC

Investigating

We are currently experiencing indexing lag due to multiple node failures. Our infra team is currently investigating and taking proper steps to mitigate the indexing lag. Customers will experience delays in searching for data ingested in the past few hours. We will update this message in 1 hour or sooner with more information.

Posted Jan 27, 2020 - 20:10 UTC

This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Search).