Runtime Error

Incident Report for Higher Logic Platform

Postmortem

What Happened 

On May 28, 2025, following a scheduled system update, our AI Search feature became temporarily unavailable for some customer accounts. Our team immediately began working to resolve this. An initial attempt to fix the issue unfortunately caused further service disruptions for a small number of accounts. 

 

To restore service quickly, we decided to revert to the previous stable version of our system. While this helped most accounts, some issues persisted in specific configurations. As a temporary measure to ensure stability for everyone, we disabled the AI Search feature for affected accounts later that day. 

 

On May 29, our team identified the root cause of the initial problem and the subsequent issues. A comprehensive fix was developed and deployed. By late morning, the AI Search feature was fully restored for all users. 

Root Cause 

The disruption was caused by an unforeseen issue (a software bug) introduced during a recent system update. Our standard testing processes did not catch this specific bug primarily because it only appeared under particular conditions and with specific data configurations that weren't fully represented in our pre-release testing environments. The way the update was initially rolled out also meant the problem wasn't visible until it impacted live services. Furthermore, when we attempted to undo the update, this process was not immediately successful for all parts of our system, leading to our decision to temporarily disable the AI Search feature to ensure overall stability. 

 

Corrective Actions 

We are taking this incident very seriously and are implementing several measures to strengthen our systems and processes: 

 

  • Expanding and Improving Our Testing: We are enhancing our testing environments to better replicate the diverse ways our customers use our platform. This includes more comprehensive testing for features like AI Search across a wider variety of account configurations and data scenarios before any updates are released. 

  • Strengthening Our Release and Change Management Processes: We are reviewing and improving how we plan and deploy updates, especially those involving complex changes to both system code and database structures. This will help us identify and mitigate potential risks earlier. 

  • Ensuring Faster and More Complete System Recovery: We are refining our procedures for when things don’t go as planned. This means that if we need to revert an update in the future, we can do so more quickly, thoroughly, and reliably for all users and all parts of our system. 

  • Reinforcing Quality Checks: We are reinforcing our internal review processes for code changes to ensure a higher level of scrutiny and to catch potential issues before they reach our testing phases.

Posted May 30, 2025 - 17:20 EDT

Resolved

This incident has been resolved.
Posted May 28, 2025 - 17:51 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 28, 2025 - 16:55 EDT

Identified

The issue has been identified and a fix is being implemented.
Posted May 28, 2025 - 15:29 EDT

Investigating

We are currently investigating this issue.
Posted May 28, 2025 - 14:48 EDT
This incident affected: Community.