Search performance degraded.

Incident Report for Mezmo Status Page

Postmortem

Start Time: April 28, 2021 at 12:37 UTC
End Time: May 3, 2021 at 00:54 UTC

What happened:

Newly submitted log lines from all customers were significantly delayed before being available in our WebUI for searching, graphing, and timelines. Alerting, Live Tail, and the uploading of archives to their destinations were significantly delayed as well. The incident was opened on April 28, 12:37 UTC.

Typical mitigation steps were taken, but unsuccessful. Live Tail and alerting -- which were also significantly degraded -- were halted, about 14 hours after the start of the incident. This step was taken to keep other services, such as Search, functioning and give more resources to processing log lines. Logs submitted before the incident continued to be searchable.

By May 1, 19:17 UTC, about 99% of newly submitted logs were again available in our WebUI at normal rates. Other essential services needed more time and manual intervention to recover. The incident was closed on May 3, 00:54 UTC.

Ingestion of new logs lines from clients continued normally throughout the incident.

‌

Why it happened:

We deployed an updated version of our proprietary messaging bus / parsing pipeline. This version had been tested in staging and multiple production regions beforehand and worked as expected. It was deployed and worked normally in production for four days. The cumulative traffic to our service over those four days revealed a performance issue that affected the processing of new log lines: logs were processed, but at a very slow rate. We’ve identified the cause of the slow performance as an update to node.js (version 14) that was part of the new version of our messaging bus.

‌

How we fixed it:

Once the source of the failure had been identified, we reverted our messaging bus to its last stable version, which kept the delays in processing from degrading further. Our services still needed to process logs ingested up to that point, which required time, manual intervention, and more resources. We temporarily increased the number of servers dedicated to processing logs by about 60%. We also halted Live Tail and alerting, which were degraded almost to the point of being non-functional.

Through the combination of these efforts, all logs were eventually processed and our service was again entirely operational.

‌

What we are doing to prevent it from happening again:

During the incident, the new version of our messaging bus was reverted to its previous version. The version in production today does not contain the upgrade to node.js 14, which caused the performance degradation. We’ve removed node.js 14 from any future upgrades until we’ve had time to carefully examine its performance issues.

Posted May 04, 2021 - 18:20 UTC

Resolved

Alerting and Live Tail are working normally and newly submitted logs are available in the UI. Archives are currently being uploaded to their destinations. All services are fully operational.

Posted May 03, 2021 - 00:54 UTC

Monitoring

New logs are being made available in the UI at normal rates. Live Tail and Alerting have been re-enabled and returned to normal operation. We are working on processing the backlog of archives and monitoring the system.

Posted May 02, 2021 - 22:20 UTC

Update

New logs are appearing in the UI quickly. Live Tail and alerting continue to be halted. Archived logs may appear with delays for some customers.

Posted May 02, 2021 - 12:58 UTC

Update

New logs are appearing in the UI much quicker than before. Live Tail and alerting are still halted while we are backfilling logs. Archived logs may appear with delays for some customers.

Posted May 02, 2021 - 05:57 UTC

Update

New logs are now being made available quickly in our UI, with a small number of exceptions (~1%). Live Tail and alerting continue to be halted. We are working on sending archives to their proper destinations.

Posted May 01, 2021 - 19:17 UTC

Update

New logs are still appearing in the UI with delays. Live Tail and alerting continue to be halted while we are backfilling logs. Some customers may experience delays in excess of 72 hours for archived logs to appear in their archive destination.

Posted May 01, 2021 - 17:53 UTC

Update

New logs are still appearing in the UI with delays for some customers. Live Tail and alerting have been halted while we are backfilling logs.

Posted May 01, 2021 - 05:50 UTC

Update

Our investigations suggest this incident began with an update to our messaging bus / parsing pipeline. We've successfully rolled back to an earlier version and the process of backfilling logs for real-time search has begun. To speed up this process, we are continuing to keep Live Tail and alerting halted. Newly submitted logs are appearing, but still with significant delays. Log ingestion continues to work normally and no log lines have been lost.

Posted Apr 30, 2021 - 21:12 UTC

Update

Live Tail and alerting remain stopped as we solve this incident. New logs are still appearing in the UI but with a significant delay for all customers.

Posted Apr 30, 2021 - 12:38 UTC

Update

New logs are still appearing in the UI with delays for some customers. Live Tail and alerting have been stopped as we continue to investigate.

Posted Apr 30, 2021 - 05:39 UTC

Update

As we continue working on this issue, we have stopped Live Tail and alerting. New logs are still appearing in the UI, but with delays.

Posted Apr 29, 2021 - 18:42 UTC

Update

We continue to work on restoring the service to full operation. Users will still continue to experience delays in when searching views, boards and screens. Some users may not see live tail, neither receive new alerts or getting new logs.

Posted Apr 29, 2021 - 09:01 UTC

Update

As we continue working on this issue, we have temporarily scaled down live-tail and alerting. Some customers may not see live tail and receive new alerts. New logs still appear in the UI with delays.

Posted Apr 29, 2021 - 02:43 UTC

Update

We continue to work on restoring the search service to full functionality. All users will continue to see delays of greater than 30 minutes when searching, using views, boards and screens.

Posted Apr 28, 2021 - 22:31 UTC

Update

We continue to work through completing the aforementioned procedure that is still affecting a small percentage of customers. All users will continue to see delays of greater than 30 minutes when searching, using views, boards and screens.

Posted Apr 28, 2021 - 19:24 UTC

Update

We’ll shortly perform a procedure that will cause all logs to be unavailable to a small percentage of customers for a short period of time. All users will continue to see delays of greater than 30 minutes when searching, using views, boards and screens.

Posted Apr 28, 2021 - 16:36 UTC

Update

We continue to work on restoring the service to full operation. At this time users will still continue to experience delays of greater than 30 minutes when searching views, boards and screens.

Posted Apr 28, 2021 - 15:00 UTC

Identified

We have identified the issue and we are working towards restoring the service. At this time users will still continue to experience delays of greater than 30 minutes when searching views, boards and screens.

Posted Apr 28, 2021 - 13:39 UTC

Investigating

We are currently experiencing delays with search, users will experience delays of greater than 30 minutes when searching views, boards and screens.

Posted Apr 28, 2021 - 12:37 UTC

This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Search, Alerting, Livetail, Archiving).