[WIP] Enabling AI and Statistical Analysis Capabilities for Apache SkyWalking #8883

Superskyyy · 2022-04-15T05:49:50Z

Superskyyy
Apr 15, 2022
Collaborator

Hello Community,

In the following abstract, I'd like to propose a project idea for the upcoming OSPP 2022 event. To readers who are not familiar with the event, OSPP (https://summer-ospp.ac.cn/) provides opportunities for full-time university students to participate in established open-source ecosystems. A dedicated mentor will guide the student through a predefined project within the ecosystem.

The backgrounds section below covers an overview of the state of AI technology in the Observability (DevOps/ DevSecOps) landscape (how commercial companies adopt AI).

Backgrounds
With drastic improvements in machine learning capabilities over the recent years, practical AI solutions have been adopted at scale in production-ready scenarios. In our landscape, we see major commercial observability platforms (Dynatrace, AppDynamics, New Relic) offering AI-enabled functionalities and actively evolving over the last few years.

The road to AI-assisted observability can be divided into three primary phases to my understanding. We must do this carefully in a step-by-step manner (otherwise, it will be a total disaster). The first phase involves reactive approaches, sending out alerts to users after an anomaly occurs. While the following second phase involves proactive detection, meaning we actively monitor potential degradations and alert the users before an anomaly arrives. The final stage would be automatic root cause analysis using advanced methods like fault-tree analysis; we can quickly (even automatically) carry out actions to recover incidents. As an established open-source ecosystem, we should first focus on precise reactive anomaly detection to build the basis for later phases.

The aforementioned AI capabilities are backed by two key areas of challenges where we already have secured one. Specifically, we need rich and precise data from multiple dimensions to enable accurate modelling of historical data. We need to observe projects in production in more than one pillar of observability. To date, we have successfully rolled out support for metrics, traces, logs and events for all sorts of backends and even browsers. The remaining challenge is a matter of automatic data cleansing and enrichment based on data understanding (traffic often does not come all day, all the time).

The second key area of challenge is that we would need some continuously trained models to fit past data and predict the future. At the core of production-friendly machine learning, we often face more constraints than in lab environments - no GPUs, low tolerance for false alarms, and high throughputs. Last but not least, the models should fit well without complex fine-tuning from end-users. In consideration of such requirements, deep neural networks, in my opinion, should not be the optimal choice for an out-of-box solution. Though we can provide DNN solutions where users need to find a GPU server to deploy as an alternative, it would possibly contribute to a significant research outcome if it requires minimal tuning and is relatively fast. Low training time and acceptable throughput while outperforming traditional ML baselines.

Project Overview
We aim to derive dynamic baselines and thresholds for vital operational metrics aggregated by SkyWalking; the detection coverage should match the current static alert rule capabilities. We detect anomalies in the metrics by comparing the incoming data stream with the trained baseline/ thresholds and send out alerts to end-users (based on further false-positive suppression mechanisms).

Because the project idea is ～～under preliminary research～～, the remaining sections of this proposal can be now accessed from this Google Doc, please feel free to place comments and give suggestions.

Superskyyy · 2022-04-15T17:40:04Z

Superskyyy
Apr 15, 2022
Collaborator Author

Syncing content from Google docs for those who may face trouble opening it

I don’t expect these AI functionalities, when used out-of-box, to be widely adopted by all end-users in the near future, but I do aim to break current ground in the field by open-sourcing our implementations and providing guidance to growing companies and developers who want to give a try at AIOps but don’t know where to start - practical algorithms are mostly not disclosed and especially hard for practitioners to access.

WHY NOT JUST MANUAL ALERT RULES?
Tedious if you want fine-grained alarms - plus, the operations team may not be familiar with the business requirements of every service, especially today, some ops teams are entirely decoupled from dev teams.

GOAL
The goal is to provide the fine granularity of baselines/ thresholds for relevant entity attributes and minimize the need for manual calibration of alert rules. Dynamic thresholds and static thresholds should be able to work together.

The models should be able to start training as the first hour's data arrives and periodically retrain as more data comes in. The seasonality learning part in Scenario2 only comes in after sufficient data has been acquired (hours to weeks depending on granularity).

The following 4 scenarios cover potential sub-tasks and their individual goals.

Scenario 1: Dynamic Threshold Value for Stable Metrics
Goal: Dynamically learn and set the thresholds for incoming multi-dimensional data using a combination of model output (prediction for future values) and statistical measures. Here we compare the actual values to the derived threshold point. This goal is to target metrics that do not fluctuate in terms of seasonality and non-hardware-related factors.

Scenario 2: Dynamic Baseline for Service Load and Traffic Metrics
Goal: Learn and predict seasonal fluctuations in traffic and load metrics caused by impacts from high-level sociological and business factors. Traffic can have significant seasonalities and underlying patterns according to the unique characteristics of each business/service, while previous metrics in Scenario 1, like error rate/ Apdex, do not fluctuate as much. Traffic can and should reach extreme values throughout the days and weeks, and there is no single "threshold" value to compare to in such an anomaly detection problem.

Scenario 3: Outliers detection (Bonus task)
Goal: To put it simply, in a set of instances within the same service, one of them behaves differently than others in certain attributes. We need to point out the outlier instances.

Scenario 4: Proactive detection (Bonus task)
Goal: The near-future projection of the entity metric does not look great. We need to alert the users before bad things happen.

There will always be more data types[2]; we need to evaluate carefully and support them in multiple phases, but let’s focus on the basics first.

Seasonality
We need to fit the model based on at least a week of data to provide insights based on seasonalities, that is, the metric patterns over a while (hourly, daily, weekly etc.).
Suitable base algorithms for seasonality involve holt-winters etc. (used by NewRelic in an ensemble way) https://newrelic.com/blog/how-to-relic/baseline-alerts-algorithm

NON GOAL
We should not aim to replace the current static baseline alerts.
We should not expect all seasonalities to be captured.
We should not expect a single model would fit all the needs.
We should not aim for nice metrics for predefined datasets as this is NOT just for research purposes; the ultimate goal is to put it into production by SkyWalking users at an industrial scale (though this may not be achievable at the end of the OSPP project, I would aim for an MVP).

FP Suppression Mechanism
A mechanism to suppress false positives from non-optimal state services is needed. For example - disk usage may be very high for an entity due to low resource allocation for an unimportant task. We should suppress the false alerts.
https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/problem-detection/detection-of-frequent-issues

Percentile not average? (Dynatrace discussions on data)
https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/

Constraints

We need to consider computational overhead, as training and prediction should never burden the other parts of the project; there will always be a tradeoff between performance and resource consumption.
1.1 Do not expect all end-users to have GPUs.
The models must not need end-user fine-tuning other than minor configurations.

Data
We will likely generate data with benchmark microservices projects like train-ticket(https://github.com/FudanSELab/train-ticket) with controlled anomaly injections;

We may find other open datasets on metrics [3]
https://netman.aiops.org/wp-content/uploads/2021/12/paper-ISSRE21-PUAD.pdf

https://github.com/AprilCal/TSAGen synthetic data generator (paper 2021)

http://yangyang.li/wp-content/uploads/2019/01/cmc19-Detection-YangyangLi.pdf

It will be better if we can collect some anonymous datasets from actual companies using SkyWalking (do not expect this).

MODELS [TBD]
It is essential to point out that no single model suits all; we should adopt ensemble methods and multiple models according to unique data characteristics as we start experiments. (Note, this may not be achievable in the OSPP project period)

Places to start -
Research papers and projects from academia:
https://www.connectedpapers.com/main/91c4a7a7d6cafddc7a9c8263af5ecdb8f42c9a1f/Adaptive-Anomaly-Detection-in-Performance-Metric-Streams/derivative

https://s.yimg.com/ge/labs/v2/uploads/kdd2015.pdf - A really good paper

https://github.com/yahoo/egads

https://github.com/yahoo/sherlock - time-series anomaly detection for Apache Druid

https://github.com/Stream-AD/MIDAS

On data sampling
http://yangyang.li/wp-content/uploads/2019/01/cmc19-Detection-YangyangLi.pdf

CNN + HMM
https://www.researchgate.net/publication/356679281_Time_Series_Anomaly_Detection_for_KPIs_Based_on_Correlation_Analysis_and_HMM

PU learning - this is new with code! (Semi-supervised)
https://ieeexplore.ieee.org/abstract/document/9700291
https://github.com/PUAD-code/PUAD

Industry blogs (Leading APM platforms)
https://assets.dynatrace.com/content/dam/en/wp/Anomaly-Detection-for-Monitoring-Ruxit.pdf

Integration
The resulting software artifacts could be pluggable into the SkyWalking OAP, even similar projects in the other observability ecosystems. Now this is just an initial thought, detailed design documents will be discussed and proposed in near future.

Data enrichment
Quote:”KPIs are monitored with certain interval, e.g., every minute. Occasionally, a monitoring system does not receive data, leading to missing values. We simply use linear interpolation to fill them based on their adjacent data points. “ [5]

References
https://www.ibm.com/downloads/cas/EPXPLXYY [The definitive guide to practical AIOps]
http://bbs.aiops.cloudwise.com/d/620/2 [High-level guide on anomaly detection algorithms in AIOps]
https://ieeexplore.ieee.org/abstract/document/8031053 [Adaptive Anomaly Detection in Performance Metric Streams]
https://www.modb.pro/db/48563 (Data enrichment and pure statistical baseline derivation - not ML)
https://netman.aiops.org/wp-content/uploads/2019/04/PID5748305.pdf

1 reply

shenxiangzhuang Oct 31, 2023

I collect some papers/libraries related to time series anomaly detection from well-known industrial companies such as Amazon, IBM, Microsoft, Salesforce and so on. I thought this might serve as a reference.

The following are some of the more important studies:

2017-Twitter-ESD: twitter/AnomalyDetection
2017-IRISA-SPOT: asiffer/python3-libspot
2020-Alibaba-RobustX: LeeDoYup/RobustSTL
2020-Amazon-GluonTS: awslabs/gluonts
2020-Zillow-Luminaire: zillow/luminaire
2021-Salesforce-Merlion: salesforce/Merlion
2021-Linkedin-Silverkite: linkedin/greykite
2022-IBM-AnomalyKiTS: IBM/anomaly-detection-code-pattern

Superskyyy · 2022-04-28T01:49:14Z

Superskyyy
Apr 28, 2022
Collaborator Author

Update: A possible labelled dataset https://github.com/CloudWise-OpenSource/GAIA-DataSet
Other evaluation datasets (with labels) will be derived from monitoring benchmark systems through SkyWalking.

A combined interest of three separate projects will begin in the following months

OSPP project for metrics anomaly detection.
GSOC project for log outlier detection.
MLOps flow orchestration and integration.

0 replies

Fengrui-Liu · 2022-04-29T07:17:15Z

Fengrui-Liu
Apr 29, 2022

Hello, I noticed this proposal on OSPP, and happy to see that SkyWalking community have an attention on AIOps. I am glad to participate in this project.

I agree with your idea, "DNN is an option but not the optimal choice", we can start from some unsupervised methods. Fortunately, I have tried several well-known algorithms, and recorded them in my repository StreamAD.

In addition to the goals you set above, the efficiency and generalization of these anomaly detection methods are also important. Other tasks like change point detection can also be considered.

If it is possible, I hope that we can have a great memory together this summer.:blush::blush:

1 reply

Superskyyy Apr 29, 2022
Collaborator Author

Hello, I noticed this proposal on OSPP, and happy to see that SkyWalking community have an attention on AIOps. I am glad to participate in this project.

I agree with your idea, "DNN is an option but not the optimal choice", we can start from some unsupervised methods. Fortunately, I have tried several well-known algorithms, and recorded them in my repository StreamAD.

In addition to the goals you set above, the efficiency and generalization of these anomaly detection methods are also important. Other tasks like change point detection can also be considered.

If it is possible, I hope that we can have a great memory together this summer.:blush::blush:

I visited your blog and project just now, fabulous! You would certainly make a great candidate. I will reach out to you privately for some application-related details. On the other hand, we should do our best to disclose key design decisions to the public community in alignment with the Apache Way.

Superskyyy · 2022-04-29T19:28:19Z

Superskyyy
Apr 29, 2022
Collaborator Author

I'm looking at using Apache DolphinScheduler to build our data pulling>preprocessing>train/retrain>deployment workflow orchestration, as per the discussions here there are some limitations because it is not specifically designed for ML for now, but it should be more robust than inventing wheels or using simple crontab etc. (additionally it's a fellow Apache project), we can focus more on our functionalities. Using a low-code service enables a transparent understanding so that users can implement customizations. The concern for me is, will it be too heavy for the users?

Any feedback or better suggestions are welcomed.

cc @liuhaoyang as you may be interested.

cc @kezhenxu94 as you are the mentor of our GSOC project and also committer at DolphinScheduler.

4 replies

wu-sheng Apr 30, 2022
Collaborator

How about trying Airflow first? DolphinScheduler is better for large scale scheduling, which we don't need for now. Also, Airflow -> DolphinScheduler should be easy from our side, we could support both for later.

wu-sheng Apr 30, 2022
Collaborator

Airflow has a better CLI and script/code interaction, it would make the deployment easier, and one single node is enough for us.

Superskyyy Apr 30, 2022
Collaborator Author

Airflow has a better CLI and script/code interaction, it would make the deployment easier, and one single node is enough for us.

Yes sure! Single node deployment will be sufficient, I was originally favouring DS because of their low-code feature vs Airflow code-first, now thinking the tradeoff isn't worthwhile at the current stage.

Superskyyy Apr 30, 2022
Collaborator Author

I will start implementing a POC ML pipeline just to test out the overall flow before we get into the serious stuff.

Superskyyy · 2022-05-02T15:42:55Z

Superskyyy
May 2, 2022
Collaborator Author

I have created a repository for this, I feel its better to transfer it to the SkyAPM organization. Please help me with an approval @wu-sheng https://github.com/Superskyyy/sw-aiops-engine

It says I need permission to create a public repo in the organization, may I have a temporary one?

9 replies

wu-sheng May 2, 2022
Collaborator

Once you are ready, invite me as an admin role of your repo, I could process this transfer directly.

Superskyyy May 3, 2022
Collaborator Author

Once you are ready, invite me as an admin role of your repo, I could process this transfer directly.

Sent 🌱

wu-sheng May 3, 2022
Collaborator

I accepted, but it seems I am not admin. Can't see the setting tab.

Superskyyy May 3, 2022
Collaborator Author

Once you are ready, invite me as an admin role of your repo, I could process this transfer directly.

I think there's no such thing as admin in a personal repository other than owner... Let me just transfer the repo to you first then? 😆

wu-sheng May 3, 2022
Collaborator

Done, new repo https://github.com/SkyAPM/aiops-engine-for-skywalking

Superskyyy · 2022-06-22T20:09:44Z

Superskyyy
Jun 22, 2022
Collaborator Author

Update: The project is under ongoing exploration with the help of OSPP and GSOC students.

Metrics anomaly detection - OSPP
Log clustering (log overview) + Log reduction (highlighting potentially anomalous logs) - GSOC

Extra help is welcomed, ping me if you have experience in data science and would like to take over some tasks.

3 replies

wu-sheng Jun 22, 2022
Collaborator

I could help on your dataflow and architecture design review, if you need.

Superskyyy Jun 23, 2022
Collaborator Author

I could help on your dataflow and architecture design review, if you need.

Yes, this definitely would benefit from expert reviews on architecture decisions. Will have a draft done within a month. Currently, some algorithms are still in the experimental phase and we need to verify feasibility, especially for the log clustering technique.

Superskyyy Jun 23, 2022
Collaborator Author

I have a design doc and DFDs in the works.

wu-sheng · 2022-09-19T15:24:49Z

wu-sheng
Sep 19, 2022
Collaborator

I just unpinned this, because this has been pinned for months, and we have two important posts.

0 replies

[WIP] Enabling AI and Statistical Analysis Capabilities for Apache SkyWalking #8883

Superskyyy Apr 15, 2022 Collaborator

Replies: 7 comments · 18 replies

Superskyyy Apr 15, 2022 Collaborator Author

shenxiangzhuang Oct 31, 2023

Superskyyy Apr 28, 2022 Collaborator Author

Fengrui-Liu Apr 29, 2022

Superskyyy Apr 29, 2022 Collaborator Author

Superskyyy Apr 29, 2022 Collaborator Author

wu-sheng Apr 30, 2022 Collaborator

wu-sheng Apr 30, 2022 Collaborator

Superskyyy Apr 30, 2022 Collaborator Author

Superskyyy Apr 30, 2022 Collaborator Author

Superskyyy May 2, 2022 Collaborator Author

wu-sheng May 2, 2022 Collaborator

Superskyyy May 3, 2022 Collaborator Author

wu-sheng May 3, 2022 Collaborator

Superskyyy May 3, 2022 Collaborator Author

wu-sheng May 3, 2022 Collaborator

Superskyyy Jun 22, 2022 Collaborator Author

wu-sheng Jun 22, 2022 Collaborator

Superskyyy Jun 23, 2022 Collaborator Author

Superskyyy Jun 23, 2022 Collaborator Author

wu-sheng Sep 19, 2022 Collaborator

Superskyyy
Apr 15, 2022
Collaborator

Replies: 7 comments 18 replies

Superskyyy
Apr 15, 2022
Collaborator Author

Superskyyy
Apr 28, 2022
Collaborator Author

Fengrui-Liu
Apr 29, 2022

Superskyyy Apr 29, 2022
Collaborator Author

Superskyyy
Apr 29, 2022
Collaborator Author

wu-sheng Apr 30, 2022
Collaborator

wu-sheng Apr 30, 2022
Collaborator

Superskyyy Apr 30, 2022
Collaborator Author

Superskyyy Apr 30, 2022
Collaborator Author

Superskyyy
May 2, 2022
Collaborator Author

wu-sheng May 2, 2022
Collaborator

Superskyyy May 3, 2022
Collaborator Author

wu-sheng May 3, 2022
Collaborator

Superskyyy May 3, 2022
Collaborator Author

wu-sheng May 3, 2022
Collaborator

Superskyyy
Jun 22, 2022
Collaborator Author

wu-sheng Jun 22, 2022
Collaborator

Superskyyy Jun 23, 2022
Collaborator Author

Superskyyy Jun 23, 2022
Collaborator Author

wu-sheng
Sep 19, 2022
Collaborator