The subject of February’s San Francisco Metrics Meetup was anomaly detection where Cody Rioux from the real-time analytics team at Netflix gave this talk on artificial intelligence and machine learning, specifically how Netflix use a custom built in house system called Kepler to run against telemetry data and spot outliers.
In the context of this talk Netflix define outlier as server instances and ‘weird’ servers putting data into the telemetry system.
The primary use of Kepler is to detect rogue server instances which was not what it was originally developed for as a python library however it terminated around 1,100 server instances over the last two to three months after it was implemented.
Rioux also explains how Netflix previously had threshold based alerting which means that if any metric goes beyond a certain threshold, an alert is triggered, regardless of whether the trigger is true or not. Netflix turned to artificial intelligence for what Rioux describes as the final 20% of what they needed to get out of the alerting system.
The talk delves into the algorithm that they use (Go to 6:45) called DBSCAN (Density Based Spatial Clustering of Applications with Noise) that captures high and low outliers as well as noisy servers that are just exhibiting jittery behaviour.
The algorithm also allows for banding in the data, Rioux explains an example from a cluster of servers, “there are two distinct acceptable behaviour patterns for this group of servers and it’s exhibiting that here. The algorithm is robust against this so it does not label either of those as outliers and that’s selectable with a parameter and we’re actually able to automatically learn that from user input on the algorithm. He uses don’t actually specify any parameters when they set these things up at least in terms of Machine learning parameters like a distance or cluster size.”
To detect anomalies they use an algorithm called Monte Carlo Markov Chain (go to 11:30). This is done in near real-time so uses small datasets but has it’s down sides too such as a slow runtime.
The Kepler system means that Netflix are able to run their huge systems in the cloud and detect outlying generating mechanisms with the minimum of human intervention in order to detect and predict outages across the network.