RADIUS Incident

Incident Report for Visp

Postmortem

RESOLVED: Primary Database Performance Degradation and Brief RADIUS Outage [April 19, 2024]

Greetings!

A RADIUS outage occurred last April 19th, 2024 that lasted approximately three hours. The cause was a degraded performance in the primary database. The database's automatic scaling mechanism could not effectively allocate resources because the database storage usage neared the maximum threshold.

Here’s a timeline of the incident:

10:43 AM PST. New payments or invoices were not immediately visible in the Transaction table but were still recorded in the Logs for each subscriber.
12:08 PM PST, a second incident occurred, preventing users from saving new invoices or storing payment details.
12:30 PM PST, Intermittent RADIUS authentication issues started occurring.

Root Cause Analysis (RCA)

Root Cause 1: Issue with Database Storage and Automatic Scaling Mechanism: The primary database relies on an automated system to scale its resources based on demand. However, the database storage size approached the maximum storage threshold, which prevented the system from triggering the necessary scaling steps during the incident.

Contributing Factor 1: Unexpected Database Surge: A higher-than-anticipated surge in database activity occurred prior to the outage. While the automatic scaling mechanism should have addressed this, the limitation mentioned in the Root Cause, resulted in database performance issues.

Contributing Factor 2: Synchronization delays between the Primary and Replica databases compounded the performance degradation already caused by the scaling issue and contributed to additional delays in certain billing queries and processes.

Mitigation Steps:

To mitigate the identified issues, the team took immediate steps to increase the maximum storage threshold and the allocated storage of our database. However, the optimization process required several hours to complete, prolonging the resolution of the RADIUS authentication issue.

Action Items:

The team will conduct a thorough investigation and implement solutions to ensure that the automatic scaling mechanism functions as expected. Review and potentially revise the thresholds and triggers for automatic scaling to ensure they are adequate to meet anticipated traffic and database load.
The team will evaluate the database storage capacity to handle potential future surges in demand.
There’s already an ongoing project to migrate the RADIUS database to a separate instance and isolate it from other services.
Conduct a post-mortem review with relevant stakeholders to discuss lessons learned and identify opportunities for improvement in
1. Our monitoring and alerting procedures for database performance.
2. Mass notifications of stakeholders or App Users in the event of an outage.

If you have any questions or concerns, feel free to reach out to your Visp Client Success team via success@visp.net, or call at 541-955-6900.

Posted Apr 23, 2024 - 23:57 UTC

Resolved

We've received an update from the team that RADIUS has normalized and authentication should be working as expected. We’re watching the systems closely and have already started a post-mortem to identify and address any related contributing factors.

Thank you for your patience. Additional details will be provided as updates become available.

To stay updated on this incident and other reported issues, subscribe to our status page (https://status.visp.net).

Posted Apr 19, 2024 - 21:54 UTC

Monitoring

This is added as a separate incident from the Transaction delays reported earlier, so Users can monitor updates from here,

Posted Apr 19, 2024 - 21:52 UTC

This incident affected: VISP.