Indexing Delay
Incident Report for Mezmo Status Page
Postmortem

Dates:

Start Time: Friday, February 26, 2021, at 06:43 UTC
End Time: Friday, February 26, 2021, at 20:42 UTC

What happened:

The insertion of newly submitted logs stopped entirely for all accounts for about 3 hours. Logs were still available in Live Tail but not for searching, graphing, and timelines. The ingestion of logs from clients was not interrupted and no data was lost.

For more than 95% of newly submitted logs, log processing returned to normal speeds within 3 hours. All logs submitted during the 3 hour pause were available again about 30 minutes later.

For less than 5% of newly submitted logs, log processing returned to normal speeds gradually. Logs submitted during the 3 hour pause also gradually became available. This impact was limited to about 12% of accounts.

The incident was closed when logs from all time periods for all accounts were entirely available.

Why it happened:

Our service ran out of a set of resources that manage pre-sharding on the clusters that store logs, an operation that ensures new logs are promptly inserted into the clusters. This happened because of several simultaneous changes to our infrastructure that didn’t account for the need for more resources, particularly on clusters with a relatively large number of shards relative to their overall storage capacity. The insertion of new logs slowed down and the backlog of unprocessed logs grew. Eventually, the portion of our service that processes new logs was unable to keep up with demand.

How we fixed it:

We restarted the portion of our service that processes newly submitted logs. During the recovery, we prioritized restoring logs submitted in the last day. 95% of accounts were fully recovered after 3.5 hours.

What we are doing to prevent it from happening again:

We’ve increased the scale of the set of resources that ensure logs are processed promptly by adding more servers for these resources to run upon. We’ve also added alerting for when these resources are reaching their limit.

Posted Mar 11, 2021 - 19:22 UTC

Resolved
We resolved the issue and all services are operational.
Posted Feb 26, 2021 - 20:42 UTC
Monitoring
We resolved the issue and the service has returned to normal. We are closely monitoring the environment at this time.
Posted Feb 26, 2021 - 12:50 UTC
Update
We are continuing to work on a fix for this issue.
Posted Feb 26, 2021 - 09:36 UTC
Update
We are continuing to work towards restoring the search of recently ingested logs. At this time Users will experience searching, boards and screens not returning results for recently ingested logs.
Posted Feb 26, 2021 - 09:12 UTC
Identified
Customers may experience delays with newly ingested logs and searching. The issue has been identified and a fix is being implemented.
Posted Feb 26, 2021 - 07:08 UTC
Investigating
We are currently investigating this issue.
Posted Feb 26, 2021 - 06:43 UTC
This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Web App, Search, Livetail).