Higher Logic Thrive Marketing Enterprise (Real Magnet) - Sending Delays
Incident Report for Higher Logic Platform
Postmortem

Date: January 17, 2024

What Happened:
A subset of customers on Higher Logic Marketing Enterprise (Real Magnet) experienced a disruption in
their ability to send emails. The disruption began late Tuesday morning and persisted until 9:21 AM on
Wednesday. The problem stemmed from a server issue preventing this set customers from sending
emails from our platform.

Timeline (All times in EST):
1/16/2024

1:07 PM – Outbound mail server application (MTA) entered a faulted state and automatically
restarted. The application remained in an undetected, faulted state after restarting.
3:05 PM – Approximate time of first customer report of issue received.
3:11 PM – Issue escalated to Application Engineering team; technical investigation started.
5:36 PM – Issue escalated to Development team; investigation continued.
6:20 PM – Investigation indicated that email was being delivered; consequently, the Issue was
escalated to non-on-call resource.

1/17/2024

7:55 AM – Investigation resumed.
9:06 AM – Issue escalated to Platform team.
9:13 AM – Impacted server restarted.
9:21 AM – Inbound and outbound email traffic returned to normal; outbound queued traffic
being delivered.

Root Cause:
The MTA (sending software) application on one mail server crashed and restarted in an unstable state
on Tuesday at around 1:07 PM ET.

Details:
The sending service was restored by restarting the mail server. Once this server was rebooted, we
immediately began seeing outbound mail from that server for the impacted clients.

The troubleshooting on January 16 indicated that email was being delivered; however, the email servers
were configured in a parallel state such that email queued on one server would not be delivered until
the fault was corrected on that server while email routed to other servers was being delivered. The
result was that the email was queued but not delivered while the single server was in a faulted state.
This incorrect diagnosis delayed the escalation and resolution of the problem.

The faulted state was one that had not been previously observed. Monitoring and alerting should have
detected the partial failed state and more clearly reported the condition to technical staff.

Corrective Actions:

  • Working with MTA software vendor to determine the cause of the fault and any remediation
    necessary to prevent future faults.
  • Further investigation and training on email delivery via the MTA to better understand the
    current condition of our email delivery.
  • Improve monitoring/alerting on the server process to better report email delivery failures.
  • Simplify notification/escalation process and educate staff with clear guidelines to facilitate afterhours
    escalations.
Posted Jan 19, 2024 - 15:32 EST

Resolved
We've been monitoring for the last few hours and all is going well. We are marking this issue resolved.

We plan to have a root cause analysis (RCA) ready for distribution in the next 3 business days and will post it here.
Posted Jan 17, 2024 - 14:58 EST
Monitoring
Our Engineering team identified and corrected the issue and we see messages are starting to catch up.

We'll continue to monitor this throughout the day before resolving.

We plan to have a root cause analysis (RCA) ready for distribution in the next 3 business days.
Posted Jan 17, 2024 - 09:56 EST
Investigating
We are experiencing message sending delays for all customers. Our Engineering team is investigating the issue.

If you have a message that should've been sent by now, please leave the message as they are and they will be sent out once the delay has been resolved. This issue will also affect message tracking data.

We apologize for the inconvenience and appreciate your patience.
Posted Jan 17, 2024 - 08:42 EST
This incident affected: Marketing Enterprise (Real Magnet).