Bob Mackin
Nobody noticed that servers crashed at the E-Comm 9-1-1 call centre for more than two hours because of a missed email.
The server room at the East Vancouver facility, which also houses City of Vancouver’s primary data centre, suffered a chiller failure at 2:07 a.m. on March 16 due to a faulty temperature sensor, according to the incident report obtained by theBreaker.news under the freedom of information law.
The backup system failed to activate, exacerbating the situation, said E-Comm’s facilities manager in an email to senior executives.
“We received a critical alarm email for ‘Ecomm, Chiller1’ at 2:07 a.m.” wrote Verin Jekkal. “Unfortunately it was missed due to its classification as a single-chime email.”
The result was lost response time to both the chiller failure and the 4 a.m. systems outage.
Not until 6:15 a.m. did anyone notice.
That is when a technical specialist from the city checked his e-mail and saw the alert notifications. He immediately called his manager, Francis Tan, and key colleagues.
“At about 6:30 a.m. Francis contacted [Jekkal] which was when E-Comm was first advised of the issue,” wrote Kyle Foster, the city’s director of infrastructure and operations, in the city’s incident report.
“None of these alerts were configured to be sent via text or phone call, nor are they monitored by a 24×7 service; they went unnoticed.”
City and E-Comm staff arrived on-site to begin recovery at 7 a.m. They opened the data centre doors and set-up large fans outside the rooms to disperse the heat, which reached a sweltering 57 degrees Celsius.
Technicians from ventilation and air conditioning contractor Trane arrived at 8 a.m. and both primary and backup chillers were operational by 8:40 a.m. It was cool enough to power equipment back-up by 9:30 a.m.
While 9-1-1 continued to function, emergency call-takers and dispatchers used paper instead of computer-aided dispatch systems and white boards instead of screens. Except for an early slowdown, their ability to take calls was not impacted. They could not, however, access the B.C. and federal police reporting databases.
The outage affected anyone trying to use the city website or Van311 app between 4 a.m. and 8 a.m. and the city’s 3-1-1 non-emergency hotline between 7 a.m. and 10:30 a.m. It also impacted the 400 staff, mainly firefighters and community centre workers, on shift that morning.
Technicians rebooted the server at 12:30 p.m. on March 17 and rebooted and restarted multiple servers and monitored all applications for recovery throughout March 18.
Foster’s report called it a high severity incident that would only have been worse had the outage occurred on a weekday. He found deficiencies in dashboard monitoring, internal staff and external partner contact details, incident reporting, response and communication procedures.
On the positive side of the ledger, e-mail, Teams, OneDrive, and SharePoint Online remained accessible due to those being cloud-based programs. City workers also had the benefit of a recent three-day business continuity plan exercise.
Internal E-Comm communication suggested an initial concern that the data centre could have been hacked. But a spokesperson for the company said that scenario was ruled out.
“The cooling system failure was triggered by a technical problem, a fault in an electrical component of the main cooling system, and not malicious interference,” said E-Comm communications manager Carly Paice. “Next steps include the completion of a full post-incident assessment of the outage that is still underway, and incorporating lessons learned into ongoing work to strengthen technology resiliency.”
Support theBreaker.news for as low as $2 a month on Patreon. Find out how. Click here.