Date: January 17, 2024
What Happened:
A subset of customers on Higher Logic Marketing Enterprise (Real Magnet) experienced a disruption in
their ability to send emails. The disruption began late Tuesday morning and persisted until 9:21 AM on
Wednesday. The problem stemmed from a server issue preventing this set customers from sending
emails from our platform.
Timeline (All times in EST):
1/16/2024
1:07 PM – Outbound mail server application (MTA) entered a faulted state and automatically
restarted. The application remained in an undetected, faulted state after restarting.
3:05 PM – Approximate time of first customer report of issue received.
3:11 PM – Issue escalated to Application Engineering team; technical investigation started.
5:36 PM – Issue escalated to Development team; investigation continued.
6:20 PM – Investigation indicated that email was being delivered; consequently, the Issue was
escalated to non-on-call resource.
1/17/2024
7:55 AM – Investigation resumed.
9:06 AM – Issue escalated to Platform team.
9:13 AM – Impacted server restarted.
9:21 AM – Inbound and outbound email traffic returned to normal; outbound queued traffic
being delivered.
Root Cause:
The MTA (sending software) application on one mail server crashed and restarted in an unstable state
on Tuesday at around 1:07 PM ET.
Details:
The sending service was restored by restarting the mail server. Once this server was rebooted, we
immediately began seeing outbound mail from that server for the impacted clients.
The troubleshooting on January 16 indicated that email was being delivered; however, the email servers
were configured in a parallel state such that email queued on one server would not be delivered until
the fault was corrected on that server while email routed to other servers was being delivered. The
result was that the email was queued but not delivered while the single server was in a faulted state.
This incorrect diagnosis delayed the escalation and resolution of the problem.
The faulted state was one that had not been previously observed. Monitoring and alerting should have
detected the partial failed state and more clearly reported the condition to technical staff.
Corrective Actions: