RESOLVED: Primary Database Performance Degradation and Brief RADIUS Outage [April 19, 2024]
Greetings!
A RADIUS outage occurred last April 19th, 2024 that lasted approximately three hours. The cause was a degraded performance in the primary database. The database's automatic scaling mechanism could not effectively allocate resources because the database storage usage neared the maximum threshold.
Here’s a timeline of the incident:
Root Cause Analysis (RCA)
Root Cause 1: Issue with Database Storage and Automatic Scaling Mechanism: The primary database relies on an automated system to scale its resources based on demand. However, the database storage size approached the maximum storage threshold, which prevented the system from triggering the necessary scaling steps during the incident.
Contributing Factor 1: Unexpected Database Surge: A higher-than-anticipated surge in database activity occurred prior to the outage. While the automatic scaling mechanism should have addressed this, the limitation mentioned in the Root Cause, resulted in database performance issues.
Contributing Factor 2: Synchronization delays between the Primary and Replica databases compounded the performance degradation already caused by the scaling issue and contributed to additional delays in certain billing queries and processes.
Mitigation Steps:
To mitigate the identified issues, the team took immediate steps to increase the maximum storage threshold and the allocated storage of our database. However, the optimization process required several hours to complete, prolonging the resolution of the RADIUS authentication issue.
Action Items:
Conduct a post-mortem review with relevant stakeholders to discuss lessons learned and identify opportunities for improvement in
If you have any questions or concerns, feel free to reach out to your Visp Client Success team via success@visp.net, or call at 541-955-6900.