Ingestion Delay
Incident Report for Mezmo Status Page
Postmortem

In an effort to provide better overall service, stability, and performance for customers, LogDNA scheduled an infrastructure migration onto a more robust platform on the week of September 4, 2019. During this migration effort, our engineering teams worked around the clock to prepare, test, and perform the migration to make it as seamless as possible.

The following week post-migration on Monday, September 9th, 2019, around 9 am PDT, our production infrastructure began to experience large amounts of live tail, indexing, and alerting delays. Our infrastructure team responded quickly to the automated alerts we received in an attempt to mitigate the delay. During the initial few hours of the incident, our infrastructure team identified two major causes to the delay:

  1. A massive spike in overall traffic to the cluster of around four times the normal rate during business hours
  2. A lack of on-demand resource availability to handle the large spike in traffic

As a direct result, the following services were affected:

  1. Ingestion: The cluster was ingesting logs, but users were unable to search them or view them in live tail. Our system has built-in mechanisms that queue incoming batches of logs; with the large spike, that queue backed up, causing a delay in logs appearing in the cluster. To the end user, this behavior manifests as logs appearing to be dropped when the logs are actually arriving into the cluster but are not viewable or searchable, meaning live tail does not show log lines, so users are unable to search them.
  2. Alerting: Our alerting services run off our ingestion services. The service consistently checks the queue to decide whether logs are coming in (presence or absence alerting), then triggers based on that criteria. Due to the delay in the ingestion of batches of logs, alerts began to misfire or fail to fire until the delays were resolved.

To keep the production environment from suffering data loss and expedite the reduction in the delay, our infrastructure team paused all non-critical infrastructure and slowly re-added components as the backlog began to clear up. Engineers also updated and patched our jobs to reduce the potential of a future, similar impact. Finally, the infrastructure team added more hardware to our systems to increase capacity. In all, the worst of the delays cleared on Tuesday, September 10, 2019, at 11:30pm PDT, and the last of the non-critical infrastructure was returned to production by Friday, September 13, 2019, at 01:00am PDT. The additional hardware capacity was ready and deployed by Monday, September 17, 2019 at 4:30pm PDT.

We know that you have a lot of options for logging providers, and we understand that this incident caused a significant negative impact on our customers. Ensuring that there are no disruptions in your business operations are the highest priority for us. We will not sugar-coat this: We messed up. We have since taken the necessary actions to correct our processes to prevent such incidents from happening in the future.

Posted Sep 19, 2019 - 18:48 UTC

Resolved
The incident has been resolved.
Posted Sep 13, 2019 - 14:49 UTC
Update
Update: We are still closely monitoring our infrastructure. Service has been restored and indexing delay is gone. Currently, filtering is still not being updated. We should be resolving the filter issue fairly shortly.
Posted Sep 12, 2019 - 17:54 UTC
Monitoring
Our infra team has implemented a fix to restore service. Live tail, indexing, and alerting delays are resolved. We are closely monitoring the state of the infrastructure at this time to prevent any further delays. Currently, filters within the top menu might be delayed.
Posted Sep 11, 2019 - 06:42 UTC
Update
Unexpected traffic along with scaling issues to handle the increased traffic caused delays with indexing, alerting and live tail. The start time was around 17:00 UTC Sept 9th, 2019. All customers are experiencing the above-said delays. Our infrastructure team and all other stakeholders are working diligently to get the scaling issue resolved as soon as possible and the application back to being fully operational.

The issue has been identified and this status will be updated soon with more information as our teams work on the incident.
Posted Sep 11, 2019 - 00:53 UTC
Update
We are continuing to work through the ingestion issues. It is likely that you are not getting alert notifications at this time.
Posted Sep 10, 2019 - 22:08 UTC
Identified
The issue has been identified. Indexing and live tail look to be slightly delayed, alerting is still delayed. We will set to monitoring as soon as the delay has disappeared.
Posted Sep 10, 2019 - 14:41 UTC
Investigating
We are currently experiencing a delay in ingestion including livetail, alerting, and indexing. Logs are being ingested but could be delayed 1 hour or more. Our infra team is actively working on this issue.
Posted Sep 10, 2019 - 05:50 UTC
This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog)).