Alerts to Slack

Incident Report for Mezmo Status Page

Postmortem

Discovery

We began receiving customer reports this morning that they were no longer receiving alerts from LogDNA. Upon initial inspection, we immediately identified that our alert sending application was no longer sending alerts, although it appeared to be running and functioning normally otherwise. The good news was a quick restart of this application immediately restored service, but the bad news was we had no idea why it stopped working in the first place. It was time for a thorough investigation.

Investigation

After digging in, we realized that this incident started just after a redis cache server failover event. Even though our alert sending application reconnected to the new redis cache master within seconds (as expected), it stopped receiving events from redis to send out alerts. To make matters worse, we have been dogfooding an entirely separate pipeline for our own account and we continued to receive our own alerts even though customers were not receiving theirs, making it much harder for us to detect anything was wrong. To address the monitoring blind spot, we plan on adding anomaly detection, and in particular, absence reporting, to our alerting repertoire.

Analysis

One rather unique attribute of our alert sending application compared to our other applications is that it subscribes to a pattern on our redis cache. As it turns out, whenever a redis cache failover event occurs, a new connection is established to the newly elected redis cache master server. When this happens, all of the previous subscriptions that were attached to the old redis connection need to be created for the new connection. We did not have logic in place to handle this resubscription process.

Solution

While we have tested redis failover events for other applications in a testing environment, we did not test our alert sending application specifically for receiving subscription events after a successful failover reconnect. To prevent this issue from happening in the future, we have added resubscription logic to our alert sending application, so that in the event of another redis cache failover, we will properly resubscribe to alert events.

Reflection

Although we try to avoid outages of any kind, even partial ones, this was valuable learning experience for us. We are now fully aware of how the failover mechanism works with high availability redis and a potential monitoring blind spot while dogfooding separate pipelines, and have mitigation strategies in place. As we move forward, we strive to improve our product and process, and are grateful for all the helpful feedback from our customers that have helped make us who we are today.

Posted Jul 03, 2018 - 22:06 UTC

Resolved

This incident has been resolved.

Posted Jul 03, 2018 - 22:01 UTC

Update

The underlying mechanism of the issue has been identified and a patch has been deployed. Read more about what happened in our postmortem.

Posted Jul 03, 2018 - 22:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 03, 2018 - 15:32 UTC

Identified

We have identified the issue with our Alert tool and are currently working to get it back to Normal operations.

Posted Jul 03, 2018 - 15:19 UTC

Investigating

We are currently experiencing an issue with Alerts to Slack. We are investigating the issue and will provide more details as they become available.

Posted Jul 03, 2018 - 14:58 UTC