We began receiving customer reports this morning that they were no longer receiving alerts from LogDNA. Upon initial inspection, we immediately identified that our alert sending application was no longer sending alerts, although it appeared to be running and functioning normally otherwise. The good news was a quick restart of this application immediately restored service, but the bad news was we had no idea why it stopped working in the first place. It was time for a thorough investigation.
After digging in, we realized that this incident started just after a redis cache server failover event. Even though our alert sending application reconnected to the new redis cache master within seconds (as expected), it stopped receiving events from redis to send out alerts. To make matters worse, we have been dogfooding an entirely separate pipeline for our own account and we continued to receive our own alerts even though customers were not receiving theirs, making it much harder for us to detect anything was wrong. To address the monitoring blind spot, we plan on adding anomaly detection, and in particular, absence reporting, to our alerting repertoire.
One rather unique attribute of our alert sending application compared to our other applications is that it subscribes to a pattern on our redis cache. As it turns out, whenever a redis cache failover event occurs, a new connection is established to the newly elected redis cache master server. When this happens, all of the previous subscriptions that were attached to the old redis connection need to be created for the new connection. We did not have logic in place to handle this resubscription process.
While we have tested redis failover events for other applications in a testing environment, we did not test our alert sending application specifically for receiving subscription events after a successful failover reconnect. To prevent this issue from happening in the future, we have added resubscription logic to our alert sending application, so that in the event of another redis cache failover, we will properly resubscribe to alert events.
Although we try to avoid outages of any kind, even partial ones, this was valuable learning experience for us. We are now fully aware of how the failover mechanism works with high availability redis and a potential monitoring blind spot while dogfooding separate pipelines, and have mitigation strategies in place. As we move forward, we strive to improve our product and process, and are grateful for all the helpful feedback from our customers that have helped make us who we are today.