5 keys to a more automated Network Operations Center (NOC)

Drew Golden, Director, Product Management

Why is automation critical to an efficient NOC?

In the IT industry, we understand that more automation and Machine Learning (ML) will get IT operations to the next level. Many providers are anxious to make the leap from Service to Value, as illustrated in the Gartner chart below. Automation is truly the only way to get there.

The problem

The key to a healthy, efficient NOC is the seamless flow of information that leads to an automated solution—before a customer ever feels the impact of an outage.

However, many NOCs experience internal friction that trickles down to the customer and back up through tickets and angry calls. Why? There are a few common reasons:

Too many screens and tools
Siloed data (i.e. legacy systems)
Little to no business process automation
Inefficient root cause analysis

At Federos, we understand these pains all too well (having sat in the NOC ourselves), which is why we created a holistic, unified service assurance solution, Assure1®.

Before we dive into the solution to these problems, we need to take a closer look at how we, and the industry as a whole, think about automation.

Defining terms: automation

There’s an aspiration goal in the industry when it comes to automation—a “lights-out NOC” or a fully automated NOC. You can imagine a completely virtualized environment that runs on its own, with little to no human involvement needed.

Is this possible? The future seems to be headed in that direction, but we do know that our present and near-future state is not quite there yet.

The reality is, only 10-15% of the work can be fully automated. The other 85-90% still rely on humans to deliver on the actionability.

Why? Most NOCs have a mix of legacy equipment, modern equipment and tech, and virtualized systems (where everything is in the cloud). Not only are these tools separate, but they do not communicate, and as a result create a “swivel-chair” effect for NOC workers. There may be a world where nearly everything is virtualized and fully automated, but as yet, this is aspirational.

5 keys to a more automated NOC

Shift from reactive to proactive

The NOC needs processes that automate how the network identifies and resolves service-impacting incidents in real-time. Or, even better, that can prevent incidents before they happen. Reacting to negative events or customer tickets is inefficient and costly. Automation and Machine Learning can scale your ability to predict and prevent issues before they occur.

Bring data into a unified platform

The need to consolidate and process information quickly is paramount to the success of any network operations team. Until now, Communication Service Providers (CSPs), Managed Service Providers (MSPs) and other enterprises have struggled to visualize their expanding networks quickly and accurately in a singular view, relying on legacy tools and manual practices to monitor critical network functions and services. The proliferation of inventory systems, siloed applications, and the fractured network infrastructures brought together through acquisitions, has created significant visibility gaps to the NOC, negatively impacting productivity and increasing costs.

Industry-leading root cause analysis

Once you have consolidated data in one platform, you need to quickly pinpoint, analyze and resolve the root cause of service-impacting events. A system like Assure1® helps you eliminate and suppress massive amounts of noise to ensure your operations team always acts correctly against incidents that typically result in impacted services.

With ML and event analytics, you can leverage industry-standard ML algorithms with special data filters to normalize data, ensuring correct patterns are fed into the ML engine.

Using these data streams, the solution helps you detect anomalies, such as temporal deviations, statistical rarities and unusual behaviors, to generate a singular root causal event. Root causal events contain suppression patterns that filter out noise to improve NOC operators’ rate of predictability to resolve problems versus responding to a storm of event alarms (again, allowing you to be proactive instead of reactive).

Identify what is actionable

At Federos, we talk a lot about actionability because it is the key to effective automation. Operations teams must shift to an actionability mindset in order to drive automation.

ML and event analytics rounds out the three-prong Assure1® strategy for providing customers with industry-leading root cause analysis (RCA). Federos delivers three types of RCA, and the final one is tied to actionability that requires a human:

Topological RCA by leveraging physical and virtual topology discovery
Unsupervised Machine Learning RCA that learns from patterns and does not require topology
Supervised RCA, where operators can flag noise fields and tie them to known root causes

What you should be automating right now:

Inventory Drift: Discover when inventory is drifting and automate a trouble ticket (this can happen 20, 30, 100 times a day). Assure1® Universal Topology can quickly and accurately depict topological changes in near real-time. It includes a fully integrated cross-domain topology and relationship management function to handle any technology–logical and physical.
Event Storms and Dips: Driven by event storms (or sudden dips in events) that are caused by a singular root cause. For example: cut fiber and element management systems disconnect.
Abnormal Behavior: Driven by learning the noise fields of every device, down to ports on switches. The abnormal behavior rule generates and escalates events based on anomalies not common to that port or device. For example, a core router port that has previously been stable but suddenly begins having issues, would be flagged and escalated for analysis.
NOC Operational Performance: Looks at how different types of events are handled and learns how each kind of event is managed in the NOC. Based on this information, the solution sends an alert when an event is abnormally handled. For example, if a NOC operator acknowledges a downed port by adding a journal entry and then clearing the alarm, that incident would be “learned” by Assure1® as normal for that type of event. In this case, in the future if someone accidentally cleared an event without working on it, that action would raise an alarm.

Simplify and automate the NOC

So, now we ask you: how much time are you spending in reactive mode or on manual, time-consuming processes? Are you being asked to do more with less information?

Unfortunately, those are typical NOC conditions—and they shouldn’t be.

Assure1® collects and normalizes fault, performance, topology, service, and other external data into a single, unified platform. Advanced correlation and analysis, including AI/Machine Learning, produces actionable insights that drive automation and improve operational efficiency while significantly lowering costs.