Investigation on slow ingestions

Incident Report for Mezmo Status Page

Postmortem

Dates:

The incident was opened on October 14, 2020 - 20:55 UTC.
The impact was largely resolved by October 16, 2020 - 23:00 UTC.
We monitored usage until the incident was closed, on October 19, 2020 - 23:35 UTC.

What happened:

Attempts to send new logs to our service timed out, approximately 3% to 5% of the time. This resulted in intermittent failures to ingest logs from agents, code libraries, and REST API calls.

Most customers use our agents, which resend logs that fail to be ingested. Customers using other means to submit logs had to use their own retry methods.

Why it happened:

A node was added to our service to handle an increased need for resources. This node had previously been cordoned off because of networking failures. When it became operational, our load balancers directed a percentage of ingestion calls to it. Those calls, which amounted to about 3% to 5% of the total, would fail and eventually timeout.

How we fixed it:

Monitoring revealed the failures were particular to pods running on this node. We found other means to handle the increased need for resources, then stopped the pods running on the problematic node and cordoned it off again. The rate of timeouts returned to normal levels and ingestion proceeded normally.

What we are doing to prevent it from happening again:

We’re improving how we identify nodes that have been cordoned off because of problematic behavior and should not be reintroduced to our service.

Posted Oct 27, 2020 - 23:36 UTC

Resolved

This incident has been resolved. Logs are being ingested normally. All services are operational.

Posted Oct 19, 2020 - 23:35 UTC

Monitoring

A fix has been implemented and we are monitoring the results at this time.

Posted Oct 19, 2020 - 16:15 UTC

Identified

A networking issue is impacting log ingestion via API, our team is working on a fix.

Posted Oct 15, 2020 - 01:33 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 14, 2020 - 21:32 UTC

Investigating

We are currently investigating an issue with ingestion timeouts and the intermittent slow ingestion.

Posted Oct 14, 2020 - 20:55 UTC

This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku)).