Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Anomaly Detection Updation #18484

Merged
merged 6 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions openmetadata-docs/content/v1.5.x/collate-menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,12 @@ site_menu:
- category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis
url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis
isCollateOnly: true
- category: How-to Guides / Data Quality and Observability / Anomaly Detection
url: /how-to-guides/data-quality-observability/anomaly-detection
isCollateOnly: true
- category: How-to Guides / Data Quality and Observability / Anomaly Detection / Steps to Set Up Anomaly Detection
url: /how-to-guides/data-quality-observability/anomaly-detection/setting-up
isCollateOnly: true

- category: How-to Guides / Data Lineage
url: /how-to-guides/data-lineage
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: Anomaly Detection in Collate | Automated Data Quality Alerts
slug: /how-to-guides/data-quality-observability/anomaly-detection
---

# Overview

The **Anomaly Detection** feature in Collate helps ensure data quality by automatically detecting unexpected changes, such as spikes or drops in data trends. Instead of requiring users to manually define rigid boundaries for data validation, Collate dynamically learns from your data patterns through regular profiling. This allows for more accurate and flexible anomaly detection, alerting you only when there are significant deviations that might indicate underlying issues.

## Key Benefits of Anomaly Detection

- **Automated Detection of Unexpected Data Changes**: Collate can detect unexpected data behaviors, such as spikes or drops, that deviate from normal trends. This is crucial for identifying potential issues with data pipelines, backend systems, or infrastructure.
- **Dynamic Learning**: The system continuously profiles your data over time, learning its natural variations, including seasonal fluctuations. For example, if sales data varies throughout the year due to holidays, Collate’s dynamic assertions can detect this seasonality and prevent unnecessary error alerts. This allows the system to automatically adjust to your data’s evolving patterns without requiring manual configuration.
- **Flexible Configuration**: For more controlled scenarios, users can still manually define specific boundaries or thresholds to monitor data, such as ensuring values stay within a certain range. This offers both manual and automatic methods for managing data quality.

## Use Cases

### 1. Static Assertions for Simple Tests

- **Problem**: In many cases, users want to perform straightforward data tests, such as ensuring that values are not null or that there are no repeated values.
- **Solution**: Collate enables users to configure simple assertions directly from the UI. For example, users can create tests to ensure:
- Data should not be null.
- There should be no duplicate values.
- Data should not be older than a specific time frame (e.g., one day).
- Values should be greater than zero.
- **Example**: If you want to ensure that your sales data contains no null values or duplicates, you can easily configure these assertions via the UI.

### 2. Dynamic Assertions for Evolving Data

- **Problem**: Some data, such as sales figures, naturally evolves over time. For example, sales data might fluctuate daily or weekly, and manual bounds may not accurately capture these variations.
- **Solution**: Collate uses **dynamic assertions**, which automatically learn from the data by profiling it regularly. Over time, the system establishes a pattern for how the data behaves, allowing it to detect when values significantly deviate from this expected behavior.
- **Example**: If sales suddenly spike or drop beyond what is typical for your historical data, Collate will alert you to this anomaly.

## How Anomaly Detection Works

### 1. Manual Configuration of Tests

Users can manually configure tests for specific data points if they want to maintain tight control over their data quality checks. For instance, you can specify that a value must stay between 10 and 100. This method is useful for data that has well-understood constraints or when precise validation rules are required.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 2. Dynamic Assertions

For more complex or evolving datasets, Collate offers **dynamic assertions**. These assertions automatically adapt to your data by learning its natural patterns over time. The profiling process typically takes around five weeks, during which the system builds an understanding of normal data fluctuations.

- **Data Profiling**: Collate continuously scans the data and trains its models based on the profiled data. Once this learning phase is complete, the system can detect significant deviations from expected patterns, alerting users to anomalies.

- **Advantages of Dynamic Assertions**:
- **Adaptability**: No need to set manual thresholds for evolving datasets.
- **Efficiency**: Focus on genuine anomalies instead of managing static tests that may quickly become outdated as data evolves.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png"
alt="Dynamic Assertions"
caption="Dynamic Assertions"
/%}

### 3. Incidents and Notifications

When an anomaly is detected, Collate automatically generates incidents, including for rule-based test cases. These notifications help users quickly understand when and where their data may be behaving unexpectedly.

- **Example**: If sales data suddenly shows an abnormal spike or drop, Collate will notify you, allowing you to investigate potential causes such as system malfunctions or external influences.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png"
alt="Incidents and Notifications"
caption="Incidents and Notifications"
/%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Set Up Anomaly Detection in Collate for Data Quality
slug: /how-to-guides/data-quality-observability/anomaly-detection/setting-up
---

# Steps to Set Up Anomaly Detection

### 1. Create a Test from the UI
- First, select the dataset and navigate to the **Tests** section in the Collate UI.
- Define your test parameters. You can either create a **static test** (e.g., "no null values" or "data should not exceed a certain range") or configure **dynamic assertions** to let the system learn from the data.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-1.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 2. Configure Manual Tests
- For more controlled monitoring, set up **manual thresholds** (e.g., sales should not exceed a maximum value of 100). This provides specific control over data validation criteria.

### 3. Enable Dynamic Assertions
- For data that naturally fluctuates or evolves, enable **dynamic assertions**. Collate will start profiling your data regularly to learn its normal behavior.
- Over time (e.g., five weeks), the system will establish expected value ranges and detect any deviations from these patterns.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 4. Monitor Incidents
- After configuring tests, monitor for any **incidents** triggered by anomalies detected in the system.
- Investigate significant spikes, drops, or unusual behaviors in the data, which may indicate system errors, backend failures, or unexpected external factors.

{% image
src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

## Best Practices

- **Use Static Assertions for Simple Rules**: For basic data validation, such as preventing null values or enforcing a minimum threshold, static assertions are effective and straightforward to configure.
- **Leverage Dynamic Assertions for Evolving Data**: When dealing with datasets that naturally fluctuate (e.g., sales or user activity), dynamic assertions can save time and ensure incidents are only triggered when significant anomalies occur.
- **Regularly Review Incidents**: Stay on top of incidents generated by anomaly detection to promptly identify and address data quality issues.
- **Combine Manual and Dynamic Methods**: For datasets with well-defined boundaries and evolving characteristics, combining manual thresholds and dynamic assertions provides comprehensive anomaly detection coverage.
6 changes: 6 additions & 0 deletions openmetadata-docs/content/v1.6.x-SNAPSHOT/collate-menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -675,6 +675,12 @@ site_menu:
- category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis
url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis
isCollateOnly: true
- category: How-to Guides / Data Quality and Observability / Anomaly Detection
url: /how-to-guides/data-quality-observability/anomaly-detection
isCollateOnly: true
- category: How-to Guides / Data Quality and Observability / Anomaly Detection / Steps to Set Up Anomaly Detection
url: /how-to-guides/data-quality-observability/anomaly-detection/setting-up
isCollateOnly: true

- category: How-to Guides / Data Lineage
url: /how-to-guides/data-lineage
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: Anomaly Detection in Collate | Automated Data Quality Alerts
slug: /how-to-guides/data-quality-observability/anomaly-detection
---

# Overview

The **Anomaly Detection** feature in Collate helps ensure data quality by automatically detecting unexpected changes, such as spikes or drops in data trends. Instead of requiring users to manually define rigid boundaries for data validation, Collate dynamically learns from your data patterns through regular profiling. This allows for more accurate and flexible anomaly detection, alerting you only when there are significant deviations that might indicate underlying issues.

## Key Benefits of Anomaly Detection

- **Automated Detection of Unexpected Data Changes**: Collate can detect unexpected data behaviors, such as spikes or drops, that deviate from normal trends. This is crucial for identifying potential issues with data pipelines, backend systems, or infrastructure.
- **Dynamic Learning**: The system continuously profiles your data over time, learning its natural variations, including seasonal fluctuations. For example, if sales data varies throughout the year due to holidays, Collate’s dynamic assertions can detect this seasonality and prevent unnecessary error alerts. This allows the system to automatically adjust to your data’s evolving patterns without requiring manual configuration.
- **Flexible Configuration**: For more controlled scenarios, users can still manually define specific boundaries or thresholds to monitor data, such as ensuring values stay within a certain range. This offers both manual and automatic methods for managing data quality.

## Use Cases

### 1. Static Assertions for Simple Tests

- **Problem**: In many cases, users want to perform straightforward data tests, such as ensuring that values are not null or that there are no repeated values.
- **Solution**: Collate enables users to configure simple assertions directly from the UI. For example, users can create tests to ensure:
- Data should not be null.
- There should be no duplicate values.
- Data should not be older than a specific time frame (e.g., one day).
- Values should be greater than zero.
- **Example**: If you want to ensure that your sales data contains no null values or duplicates, you can easily configure these assertions via the UI.

### 2. Dynamic Assertions for Evolving Data

- **Problem**: Some data, such as sales figures, naturally evolves over time. For example, sales data might fluctuate daily or weekly, and manual bounds may not accurately capture these variations.
- **Solution**: Collate uses **dynamic assertions**, which automatically learn from the data by profiling it regularly. Over time, the system establishes a pattern for how the data behaves, allowing it to detect when values significantly deviate from this expected behavior.
- **Example**: If sales suddenly spike or drop beyond what is typical for your historical data, Collate will alert you to this anomaly.

## How Anomaly Detection Works

### 1. Manual Configuration of Tests

Users can manually configure tests for specific data points if they want to maintain tight control over their data quality checks. For instance, you can specify that a value must stay between 10 and 100. This method is useful for data that has well-understood constraints or when precise validation rules are required.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 2. Dynamic Assertions

For more complex or evolving datasets, Collate offers **dynamic assertions**. These assertions automatically adapt to your data by learning its natural patterns over time. The profiling process typically takes around five weeks, during which the system builds an understanding of normal data fluctuations.

- **Data Profiling**: Collate continuously scans the data and trains its models based on the profiled data. Once this learning phase is complete, the system can detect significant deviations from expected patterns, alerting users to anomalies.

- **Advantages of Dynamic Assertions**:
- **Adaptability**: No need to set manual thresholds for evolving datasets.
- **Efficiency**: Focus on genuine anomalies instead of managing static tests that may quickly become outdated as data evolves.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png"
alt="Dynamic Assertions"
caption="Dynamic Assertions"
/%}

### 3. Incidents and Notifications

When an anomaly is detected, Collate automatically generates incidents, including for rule-based test cases. These notifications help users quickly understand when and where their data may be behaving unexpectedly.

- **Example**: If sales data suddenly shows an abnormal spike or drop, Collate will notify you, allowing you to investigate potential causes such as system malfunctions or external influences.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png"
alt="Incidents and Notifications"
caption="Incidents and Notifications"
/%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Set Up Anomaly Detection in Collate for Data Quality
slug: /how-to-guides/data-quality-observability/anomaly-detection/setting-up
---

# Steps to Set Up Anomaly Detection

### 1. Create a Test from the UI
- First, select the dataset and navigate to the **Tests** section in the Collate UI.
- Define your test parameters. You can either create a **static test** (e.g., "no null values" or "data should not exceed a certain range") or configure **dynamic assertions** to let the system learn from the data.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-1.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 2. Configure Manual Tests
- For more controlled monitoring, set up **manual thresholds** (e.g., sales should not exceed a maximum value of 100). This provides specific control over data validation criteria.

### 3. Enable Dynamic Assertions
- For data that naturally fluctuates or evolves, enable **dynamic assertions**. Collate will start profiling your data regularly to learn its normal behavior.
- Over time (e.g., five weeks), the system will establish expected value ranges and detect any deviations from these patterns.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

### 4. Monitor Incidents
- After configuring tests, monitor for any **incidents** triggered by anomalies detected in the system.
- Investigate significant spikes, drops, or unusual behaviors in the data, which may indicate system errors, backend failures, or unexpected external factors.

{% image
src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png"
alt="Manual Configuration of Tests"
caption="Manual Configuration of Tests"
/%}

## Best Practices

- **Use Static Assertions for Simple Rules**: For basic data validation, such as preventing null values or enforcing a minimum threshold, static assertions are effective and straightforward to configure.
- **Leverage Dynamic Assertions for Evolving Data**: When dealing with datasets that naturally fluctuate (e.g., sales or user activity), dynamic assertions can save time and ensure incidents are only triggered when significant anomalies occur.
- **Regularly Review Incidents**: Stay on top of incidents generated by anomaly detection to promptly identify and address data quality issues.
- **Combine Manual and Dynamic Methods**: For datasets with well-defined boundaries and evolving characteristics, combining manual thresholds and dynamic assertions provides comprehensive anomaly detection coverage.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading