With RootMetrics resuming its regular on-the-ground testing in cities across the United States, the benchmarking and analysis company’s testing teams saw the major T-Mobile US outage on Monday play out in real time, as it was happening.
RootMetrics had teams doing outdoor testing in 10 cities that day, according to Suzanth Subramaniyan, director of mobile networks at RootMetrics: Stockton, California; Washington, D.C.; Palm Bay, Florida; Youngstown, Ohio; Fort Wayne, Indiana; Augusta, Georgia; Denver, Colorado; Dallas and Houston, Texas; and Greenville, South Carolina. So the company had a broad view of the network from both coasts and from north to south.
Testing was primarily being conducted on a series of Samsung Galaxy Note 10+ devices, although the testers were also previewing some Galaxy S20 devices. Around 12:34 p.m. ET, Subramaniyan said, teams in nine out of those 10 markets started seeing call failures on the devices operating on T-Mobile US’ network, with 100% failure on the devices which were both operating on and calling to T-Mobile US’ network (as opposed to cross-carrier calls).
“The device would attempt to make a call and shut itself down. The destination device wouldn’t even know that there was a call made,” Subramaniyan said. He said that as RootMetrics’ testers began to troubleshoot, they noticed that both call and SMS failures were happening, with the only SMS failures happening when SMS were generated on or sent to T-Mobile US devices. Further exploration as testers returned home from the field, also showed that voice over Wi-Fi calls were impacted. Voice calls or messages sent via OTT data applications such as WhatsApp or Facebook Messenger were able to get through, however, which led RootMetrics to believe that the problem was with T-Mobile US’ IMS server(s).
The only market not impacted was Youngstown, Ohio, Subramaniyan added.
Interestingly, Subramaniyan said, RootMetrics’ data in some other markets does show that a few calls were able to get through — likely because they were on 2G or 3G networks, as opposed to T-Mobile US’ LTE. For example, he said that RootMetrics documented a call on T-Mobile US’ network in Palm Bay at 12:52 p.m. ET that succeeded, which was identified as a 3G call.
The impacts were confined to T-Mobile US’ network, although some crowd-sourced information sources such as DownDetector — which relies in part on social media activity by end users — indicated that multiple carrier networks were experiencing outages.
“Network outages are always frustrating, but consumers might not realize that the problems they’re experiencing are due to another network rather than their own carrier. An outage from one carrier, for instance, can cause difficulties not just for users on that network but for users of other carriers who try to contact anyone on the downed network. Without knowing which carrier is down, it’s easy to assume all carriers are having issues. Our scientific testing can help identify in real-time not just where consumers are having issues but what the root cause of the outage could be,” Subramaniyan said.
Although Sprint users are now able to use the T-Mobile US network and as many as 10 million per week are doing so, RootMetrics saw differences between T-Mobile US and Sprint experience of the outage on the S20 devices in New York City. The Radio Access Network and spectrum available to the devices from both carriers would have been the same, he said, and the data experience was largely the same but the Sprint devices’ voice performance didn’t see an impact — probably because the devices were relying on different voice infrastructure (Sprint’s VoLTE deployments significantly lagged behind those of its competitors).
RootMetrics’ observations largely jibe with the report of the incident as outlined in a blog post by T-Mobile US CTO Neville Ray, which was provided to RCR Wireless News in answer to questions about the outage.
“Many of our customers experienced a voice and text issue yesterday, specifically with VoLTE (Voice over LTE) calling,” Ray wrote. “My team took immediate action — hundreds of our engineers worked tirelessly alongside vendors and partners throughout the day to resolve the issue starting the minute we were aware of it. Data connections continued to work, as did our non-VoLTE calling for many customers and services like FaceTime, iMessage, Google Meet, Google Duo, Zoom, Skype and others allowed our customers to stay in touch. Additionally, many customers were able to use circuit-switched voice connections and customers on the Sprint network were unaffected. VoLTE and text in all regions were fully recovered by 10 p.m. PDT last night. I’m happy to say the network is fully operational… and we’re working day in and day out to keep it that way.
“Our engineers worked through the night to understand the root cause of yesterday’s issues, address it and prevent it from happening again. The trigger event is known to be a leased fiber circuit failure from a third party provider in the Southeast. This is something that happens on every mobile network, so we’ve worked with our vendors to build redundancy and resiliency to make sure that these types of circuit failures don’t affect customers. This redundancy failed us and resulted in an overload situation that was then compounded by other factors. This overload resulted in an IP traffic storm that spread from the Southeast to create significant capacity issues across the IMS (IP multimedia Subsystem) core network that supports VoLTE calls.
“We have worked with our IMS (IP Multimedia Subsystem) and IP vendors to add permanent additional safeguards to prevent this from happening again and we’re continuing to work on determining the cause of the initial overload failure,” Ray wrote. “So, I want to personally apologize for any inconvenience that we created yesterday and thank you for your patience as we worked through the situation toward resolution.”