[WIP] Enabling AI and Statistical Analysis Capabilities for Apache SkyWalking #8883
Replies: 7 comments 18 replies
-
Syncing content from Google docs for those who may face trouble opening it I don’t expect these AI functionalities, when used out-of-box, to be widely adopted by all end-users in the near future, but I do aim to break current ground in the field by open-sourcing our implementations and providing guidance to growing companies and developers who want to give a try at AIOps but don’t know where to start - practical algorithms are mostly not disclosed and especially hard for practitioners to access. WHY NOT JUST MANUAL ALERT RULES? GOAL The models should be able to start training as the first hour's data arrives and periodically retrain as more data comes in. The seasonality learning part in Scenario2 only comes in after sufficient data has been acquired (hours to weeks depending on granularity). The following 4 scenarios cover potential sub-tasks and their individual goals. Scenario 1: Dynamic Threshold Value for Stable Metrics Scenario 2: Dynamic Baseline for Service Load and Traffic Metrics Scenario 3: Outliers detection (Bonus task) Scenario 4: Proactive detection (Bonus task) There will always be more data types[2]; we need to evaluate carefully and support them in multiple phases, but let’s focus on the basics first. Seasonality NON GOAL FP Suppression Mechanism Percentile not average? (Dynatrace discussions on data) Constraints
Data We may find other open datasets on metrics [3] https://github.com/AprilCal/TSAGen synthetic data generator (paper 2021) http://yangyang.li/wp-content/uploads/2019/01/cmc19-Detection-YangyangLi.pdf It will be better if we can collect some anonymous datasets from actual companies using SkyWalking (do not expect this). MODELS [TBD] Places to start - https://s.yimg.com/ge/labs/v2/uploads/kdd2015.pdf - A really good paper https://github.com/yahoo/egads https://github.com/yahoo/sherlock - time-series anomaly detection for Apache Druid https://github.com/Stream-AD/MIDAS On data sampling PU learning - this is new with code! (Semi-supervised) Industry blogs (Leading APM platforms) Integration Data enrichment References |
Beta Was this translation helpful? Give feedback.
-
Update: A possible labelled dataset https://github.com/CloudWise-OpenSource/GAIA-DataSet A combined interest of three separate projects will begin in the following months
|
Beta Was this translation helpful? Give feedback.
-
Hello, I noticed this proposal on OSPP, and happy to see that SkyWalking community have an attention on AIOps. I am glad to participate in this project. I agree with your idea, "DNN is an option but not the optimal choice", we can start from some unsupervised methods. Fortunately, I have tried several well-known algorithms, and recorded them in my repository StreamAD. In addition to the goals you set above, the efficiency and generalization of these anomaly detection methods are also important. Other tasks like change point detection can also be considered. If it is possible, I hope that we can have a great memory together this summer.:blush::blush: |
Beta Was this translation helpful? Give feedback.
-
I'm looking at using Apache DolphinScheduler to build our data pulling>preprocessing>train/retrain>deployment workflow orchestration, as per the discussions here there are some limitations because it is not specifically designed for ML for now, but it should be more robust than inventing wheels or using simple crontab etc. (additionally it's a fellow Apache project), we can focus more on our functionalities. Using a low-code service enables a transparent understanding so that users can implement customizations. The concern for me is, will it be too heavy for the users? Any feedback or better suggestions are welcomed. cc @liuhaoyang as you may be interested. cc @kezhenxu94 as you are the mentor of our GSOC project and also committer at DolphinScheduler. |
Beta Was this translation helpful? Give feedback.
-
I have created a repository for this, I feel its better to transfer it to the SkyAPM organization. Please help me with an approval @wu-sheng https://github.com/Superskyyy/sw-aiops-engine It says I need permission to create a public repo in the organization, may I have a temporary one? |
Beta Was this translation helpful? Give feedback.
-
Update: The project is under ongoing exploration with the help of OSPP and GSOC students.
Extra help is welcomed, ping me if you have experience in data science and would like to take over some tasks. |
Beta Was this translation helpful? Give feedback.
-
I just unpinned this, because this has been pinned for months, and we have two important posts. |
Beta Was this translation helpful? Give feedback.
-
Hello Community,
In the following abstract, I'd like to propose a project idea for the upcoming OSPP 2022 event. To readers who are not familiar with the event, OSPP (https://summer-ospp.ac.cn/) provides opportunities for full-time university students to participate in established open-source ecosystems. A dedicated mentor will guide the student through a predefined project within the ecosystem.
The backgrounds section below covers an overview of the state of AI technology in the Observability (DevOps/ DevSecOps) landscape (how commercial companies adopt AI).
Backgrounds
With drastic improvements in machine learning capabilities over the recent years, practical AI solutions have been adopted at scale in production-ready scenarios. In our landscape, we see major commercial observability platforms (Dynatrace, AppDynamics, New Relic) offering AI-enabled functionalities and actively evolving over the last few years.
The road to AI-assisted observability can be divided into three primary phases to my understanding. We must do this carefully in a step-by-step manner (otherwise, it will be a total disaster). The first phase involves reactive approaches, sending out alerts to users after an anomaly occurs. While the following second phase involves proactive detection, meaning we actively monitor potential degradations and alert the users before an anomaly arrives. The final stage would be automatic root cause analysis using advanced methods like fault-tree analysis; we can quickly (even automatically) carry out actions to recover incidents. As an established open-source ecosystem, we should first focus on precise reactive anomaly detection to build the basis for later phases.
The aforementioned AI capabilities are backed by two key areas of challenges where we already have secured one. Specifically, we need rich and precise data from multiple dimensions to enable accurate modelling of historical data. We need to observe projects in production in more than one pillar of observability. To date, we have successfully rolled out support for metrics, traces, logs and events for all sorts of backends and even browsers. The remaining challenge is a matter of automatic data cleansing and enrichment based on data understanding (traffic often does not come all day, all the time).
The second key area of challenge is that we would need some continuously trained models to fit past data and predict the future. At the core of production-friendly machine learning, we often face more constraints than in lab environments - no GPUs, low tolerance for false alarms, and high throughputs. Last but not least, the models should fit well without complex fine-tuning from end-users. In consideration of such requirements, deep neural networks, in my opinion, should not be the optimal choice for an out-of-box solution. Though we can provide DNN solutions where users need to find a GPU server to deploy as an alternative, it would possibly contribute to a significant research outcome if it requires minimal tuning and is relatively fast. Low training time and acceptable throughput while outperforming traditional ML baselines.
Project Overview
We aim to derive dynamic baselines and thresholds for vital operational metrics aggregated by SkyWalking; the detection coverage should match the current static alert rule capabilities. We detect anomalies in the metrics by comparing the incoming data stream with the trained baseline/ thresholds and send out alerts to end-users (based on further false-positive suppression mechanisms).
Because the project idea is ~~under preliminary research~~, the remaining sections of this proposal can be now accessed from this Google Doc, please feel free to place comments and give suggestions.
Beta Was this translation helpful? Give feedback.
All reactions