Alerting, Searching, Live Tail, Graphing, and Timelines Delays
Incident Report for Mezmo Status Page
Postmortem

Dates:

Start Time: Tuesday, November 23, 2021, at 16:42 UTC

End Time: Wednesday, November 24, 2021, at 17:00 UTC

Duration: 24:18:00

What happened:

Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines.  Some accounts (about 25%) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost.

Why it happened:

Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly.  This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed.

We isolated the slow parsing to the specific content of certain log lines.  These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines.  The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines.

How we fixed it:

The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general.

From then on, the parsing service just needed time to process the backlog of logs sent to us by customers.  Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service.  The recovery was quicker for about 75% of our customers and slower for the other 25%.

What we are doing to prevent it from happening again:

The new parsing methodology has improved our overall performance significantly.  We are also actively pursuing further optimizations.

Posted Nov 30, 2021 - 20:45 UTC

Resolved
This incident has been resolved. All services are fully operational.
Posted Nov 24, 2021 - 17:00 UTC
Update
Our services are recovering and there may be some delays in Alerting, Searching, Live Tail, Graphing, and Timelines. We are monitoring.
Posted Nov 24, 2021 - 06:57 UTC
Update
Delays are still being experienced by some customers. We continue to work towards a solution.
Posted Nov 23, 2021 - 20:03 UTC
Investigating
Some customers are experiencing delays in Alerting, Searching, Live Tail, Graphing, and Timelines. We are investigating and working to mitigate the issue.
Posted Nov 23, 2021 - 16:42 UTC
This incident affected: Log Analysis (Search, Alerting, Livetail).