From d5937273ab804493cdf594bcbf7f1bf9f599cb5e Mon Sep 17 00:00:00 2001 From: open-metadata Date: Tue, 12 Nov 2024 12:18:52 +0000 Subject: [PATCH] See https://github.com/open-metadata/OpenMetadata/commit/75ccb1adc3b98be09d05f321ebb363babeb7f36b from refs/heads/main --- content/v1.5.x/collate-menu.md | 6 ++ .../anomaly-detection/index.md | 72 +++++++++++++++++++ .../anomaly-detection/setting-up.md | 52 ++++++++++++++ content/v1.6.x-SNAPSHOT/collate-menu.md | 6 ++ .../anomaly-detection/index.md | 72 +++++++++++++++++++ .../anomaly-detection/setting-up.md | 52 ++++++++++++++ 6 files changed, 260 insertions(+) create mode 100644 content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/index.md create mode 100644 content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md create mode 100644 content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/index.md create mode 100644 content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md diff --git a/content/v1.5.x/collate-menu.md b/content/v1.5.x/collate-menu.md index 05c14df0..18ea4dab 100644 --- a/content/v1.5.x/collate-menu.md +++ b/content/v1.5.x/collate-menu.md @@ -671,6 +671,12 @@ site_menu: - category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis isCollateOnly: true + - category: How-to Guides / Data Quality and Observability / Anomaly Detection + url: /how-to-guides/data-quality-observability/anomaly-detection + isCollateOnly: true + - category: How-to Guides / Data Quality and Observability / Anomaly Detection / Steps to Set Up Anomaly Detection + url: /how-to-guides/data-quality-observability/anomaly-detection/setting-up + isCollateOnly: true - category: How-to Guides / Data Lineage url: /how-to-guides/data-lineage diff --git a/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/index.md b/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/index.md new file mode 100644 index 00000000..52d705b0 --- /dev/null +++ b/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/index.md @@ -0,0 +1,72 @@ +--- +title: Anomaly Detection in Collate | Automated Data Quality Alerts +slug: /how-to-guides/data-quality-observability/anomaly-detection +--- + +# Overview + +The **Anomaly Detection** feature in Collate helps ensure data quality by automatically detecting unexpected changes, such as spikes or drops in data trends. Instead of requiring users to manually define rigid boundaries for data validation, Collate dynamically learns from your data patterns through regular profiling. This allows for more accurate and flexible anomaly detection, alerting you only when there are significant deviations that might indicate underlying issues. + +## Key Benefits of Anomaly Detection + +- **Automated Detection of Unexpected Data Changes**: Collate can detect unexpected data behaviors, such as spikes or drops, that deviate from normal trends. This is crucial for identifying potential issues with data pipelines, backend systems, or infrastructure. +- **Dynamic Learning**: The system continuously profiles your data over time, learning its natural variations, including seasonal fluctuations. For example, if sales data varies throughout the year due to holidays, Collate’s dynamic assertions can detect this seasonality and prevent unnecessary error alerts. This allows the system to automatically adjust to your data’s evolving patterns without requiring manual configuration. +- **Flexible Configuration**: For more controlled scenarios, users can still manually define specific boundaries or thresholds to monitor data, such as ensuring values stay within a certain range. This offers both manual and automatic methods for managing data quality. + +## Use Cases + +### 1. Static Assertions for Simple Tests + +- **Problem**: In many cases, users want to perform straightforward data tests, such as ensuring that values are not null or that there are no repeated values. +- **Solution**: Collate enables users to configure simple assertions directly from the UI. For example, users can create tests to ensure: + - Data should not be null. + - There should be no duplicate values. + - Data should not be older than a specific time frame (e.g., one day). + - Values should be greater than zero. +- **Example**: If you want to ensure that your sales data contains no null values or duplicates, you can easily configure these assertions via the UI. + +### 2. Dynamic Assertions for Evolving Data + +- **Problem**: Some data, such as sales figures, naturally evolves over time. For example, sales data might fluctuate daily or weekly, and manual bounds may not accurately capture these variations. +- **Solution**: Collate uses **dynamic assertions**, which automatically learn from the data by profiling it regularly. Over time, the system establishes a pattern for how the data behaves, allowing it to detect when values significantly deviate from this expected behavior. +- **Example**: If sales suddenly spike or drop beyond what is typical for your historical data, Collate will alert you to this anomaly. + +## How Anomaly Detection Works + +### 1. Manual Configuration of Tests + +Users can manually configure tests for specific data points if they want to maintain tight control over their data quality checks. For instance, you can specify that a value must stay between 10 and 100. This method is useful for data that has well-understood constraints or when precise validation rules are required. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 2. Dynamic Assertions + +For more complex or evolving datasets, Collate offers **dynamic assertions**. These assertions automatically adapt to your data by learning its natural patterns over time. The profiling process typically takes around five weeks, during which the system builds an understanding of normal data fluctuations. + +- **Data Profiling**: Collate continuously scans the data and trains its models based on the profiled data. Once this learning phase is complete, the system can detect significant deviations from expected patterns, alerting users to anomalies. + +- **Advantages of Dynamic Assertions**: + - **Adaptability**: No need to set manual thresholds for evolving datasets. + - **Efficiency**: Focus on genuine anomalies instead of managing static tests that may quickly become outdated as data evolves. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png" + alt="Dynamic Assertions" + caption="Dynamic Assertions" + /%} + +### 3. Incidents and Notifications + +When an anomaly is detected, Collate automatically generates incidents, including for rule-based test cases. These notifications help users quickly understand when and where their data may be behaving unexpectedly. + +- **Example**: If sales data suddenly shows an abnormal spike or drop, Collate will notify you, allowing you to investigate potential causes such as system malfunctions or external influences. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png" + alt="Incidents and Notifications" + caption="Incidents and Notifications" + /%} diff --git a/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md b/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md new file mode 100644 index 00000000..63cbca18 --- /dev/null +++ b/content/v1.5.x/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md @@ -0,0 +1,52 @@ +--- +title: Set Up Anomaly Detection in Collate for Data Quality +slug: /how-to-guides/data-quality-observability/anomaly-detection/setting-up +--- + +# Steps to Set Up Anomaly Detection + +### 1. Create a Test from the UI +- First, select the dataset and navigate to the **Tests** section in the Collate UI. +- Define your test parameters. You can either create a **static test** (e.g., "no null values" or "data should not exceed a certain range") or configure **dynamic assertions** to let the system learn from the data. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-1.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + + {% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 2. Configure Manual Tests +- For more controlled monitoring, set up **manual thresholds** (e.g., sales should not exceed a maximum value of 100). This provides specific control over data validation criteria. + +### 3. Enable Dynamic Assertions +- For data that naturally fluctuates or evolves, enable **dynamic assertions**. Collate will start profiling your data regularly to learn its normal behavior. +- Over time (e.g., five weeks), the system will establish expected value ranges and detect any deviations from these patterns. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 4. Monitor Incidents +- After configuring tests, monitor for any **incidents** triggered by anomalies detected in the system. +- Investigate significant spikes, drops, or unusual behaviors in the data, which may indicate system errors, backend failures, or unexpected external factors. + +{% image + src="/images/v1.5/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +## Best Practices + +- **Use Static Assertions for Simple Rules**: For basic data validation, such as preventing null values or enforcing a minimum threshold, static assertions are effective and straightforward to configure. +- **Leverage Dynamic Assertions for Evolving Data**: When dealing with datasets that naturally fluctuate (e.g., sales or user activity), dynamic assertions can save time and ensure incidents are only triggered when significant anomalies occur. +- **Regularly Review Incidents**: Stay on top of incidents generated by anomaly detection to promptly identify and address data quality issues. +- **Combine Manual and Dynamic Methods**: For datasets with well-defined boundaries and evolving characteristics, combining manual thresholds and dynamic assertions provides comprehensive anomaly detection coverage. diff --git a/content/v1.6.x-SNAPSHOT/collate-menu.md b/content/v1.6.x-SNAPSHOT/collate-menu.md index d77b6127..7f5b65b9 100644 --- a/content/v1.6.x-SNAPSHOT/collate-menu.md +++ b/content/v1.6.x-SNAPSHOT/collate-menu.md @@ -698,6 +698,12 @@ site_menu: - category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis isCollateOnly: true + - category: How-to Guides / Data Quality and Observability / Anomaly Detection + url: /how-to-guides/data-quality-observability/anomaly-detection + isCollateOnly: true + - category: How-to Guides / Data Quality and Observability / Anomaly Detection / Steps to Set Up Anomaly Detection + url: /how-to-guides/data-quality-observability/anomaly-detection/setting-up + isCollateOnly: true - category: How-to Guides / Data Lineage url: /how-to-guides/data-lineage diff --git a/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/index.md b/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/index.md new file mode 100644 index 00000000..dacd30d9 --- /dev/null +++ b/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/index.md @@ -0,0 +1,72 @@ +--- +title: Anomaly Detection in Collate | Automated Data Quality Alerts +slug: /how-to-guides/data-quality-observability/anomaly-detection +--- + +# Overview + +The **Anomaly Detection** feature in Collate helps ensure data quality by automatically detecting unexpected changes, such as spikes or drops in data trends. Instead of requiring users to manually define rigid boundaries for data validation, Collate dynamically learns from your data patterns through regular profiling. This allows for more accurate and flexible anomaly detection, alerting you only when there are significant deviations that might indicate underlying issues. + +## Key Benefits of Anomaly Detection + +- **Automated Detection of Unexpected Data Changes**: Collate can detect unexpected data behaviors, such as spikes or drops, that deviate from normal trends. This is crucial for identifying potential issues with data pipelines, backend systems, or infrastructure. +- **Dynamic Learning**: The system continuously profiles your data over time, learning its natural variations, including seasonal fluctuations. For example, if sales data varies throughout the year due to holidays, Collate’s dynamic assertions can detect this seasonality and prevent unnecessary error alerts. This allows the system to automatically adjust to your data’s evolving patterns without requiring manual configuration. +- **Flexible Configuration**: For more controlled scenarios, users can still manually define specific boundaries or thresholds to monitor data, such as ensuring values stay within a certain range. This offers both manual and automatic methods for managing data quality. + +## Use Cases + +### 1. Static Assertions for Simple Tests + +- **Problem**: In many cases, users want to perform straightforward data tests, such as ensuring that values are not null or that there are no repeated values. +- **Solution**: Collate enables users to configure simple assertions directly from the UI. For example, users can create tests to ensure: + - Data should not be null. + - There should be no duplicate values. + - Data should not be older than a specific time frame (e.g., one day). + - Values should be greater than zero. +- **Example**: If you want to ensure that your sales data contains no null values or duplicates, you can easily configure these assertions via the UI. + +### 2. Dynamic Assertions for Evolving Data + +- **Problem**: Some data, such as sales figures, naturally evolves over time. For example, sales data might fluctuate daily or weekly, and manual bounds may not accurately capture these variations. +- **Solution**: Collate uses **dynamic assertions**, which automatically learn from the data by profiling it regularly. Over time, the system establishes a pattern for how the data behaves, allowing it to detect when values significantly deviate from this expected behavior. +- **Example**: If sales suddenly spike or drop beyond what is typical for your historical data, Collate will alert you to this anomaly. + +## How Anomaly Detection Works + +### 1. Manual Configuration of Tests + +Users can manually configure tests for specific data points if they want to maintain tight control over their data quality checks. For instance, you can specify that a value must stay between 10 and 100. This method is useful for data that has well-understood constraints or when precise validation rules are required. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 2. Dynamic Assertions + +For more complex or evolving datasets, Collate offers **dynamic assertions**. These assertions automatically adapt to your data by learning its natural patterns over time. The profiling process typically takes around five weeks, during which the system builds an understanding of normal data fluctuations. + +- **Data Profiling**: Collate continuously scans the data and trains its models based on the profiled data. Once this learning phase is complete, the system can detect significant deviations from expected patterns, alerting users to anomalies. + +- **Advantages of Dynamic Assertions**: + - **Adaptability**: No need to set manual thresholds for evolving datasets. + - **Efficiency**: Focus on genuine anomalies instead of managing static tests that may quickly become outdated as data evolves. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png" + alt="Dynamic Assertions" + caption="Dynamic Assertions" + /%} + +### 3. Incidents and Notifications + +When an anomaly is detected, Collate automatically generates incidents, including for rule-based test cases. These notifications help users quickly understand when and where their data may be behaving unexpectedly. + +- **Example**: If sales data suddenly shows an abnormal spike or drop, Collate will notify you, allowing you to investigate potential causes such as system malfunctions or external influences. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png" + alt="Incidents and Notifications" + caption="Incidents and Notifications" + /%} diff --git a/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md b/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md new file mode 100644 index 00000000..07507bf9 --- /dev/null +++ b/content/v1.6.x-SNAPSHOT/how-to-guides/data-quality-observability/anomaly-detection/setting-up.md @@ -0,0 +1,52 @@ +--- +title: Set Up Anomaly Detection in Collate for Data Quality +slug: /how-to-guides/data-quality-observability/anomaly-detection/setting-up +--- + +# Steps to Set Up Anomaly Detection + +### 1. Create a Test from the UI +- First, select the dataset and navigate to the **Tests** section in the Collate UI. +- Define your test parameters. You can either create a **static test** (e.g., "no null values" or "data should not exceed a certain range") or configure **dynamic assertions** to let the system learn from the data. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-1.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + + {% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-2.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 2. Configure Manual Tests +- For more controlled monitoring, set up **manual thresholds** (e.g., sales should not exceed a maximum value of 100). This provides specific control over data validation criteria. + +### 3. Enable Dynamic Assertions +- For data that naturally fluctuates or evolves, enable **dynamic assertions**. Collate will start profiling your data regularly to learn its normal behavior. +- Over time (e.g., five weeks), the system will establish expected value ranges and detect any deviations from these patterns. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-3.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +### 4. Monitor Incidents +- After configuring tests, monitor for any **incidents** triggered by anomalies detected in the system. +- Investigate significant spikes, drops, or unusual behaviors in the data, which may indicate system errors, backend failures, or unexpected external factors. + +{% image + src="/images/v1.6/how-to-guides/anomaly-detection/set-up-anomaly-detection-4.png" + alt="Manual Configuration of Tests" + caption="Manual Configuration of Tests" + /%} + +## Best Practices + +- **Use Static Assertions for Simple Rules**: For basic data validation, such as preventing null values or enforcing a minimum threshold, static assertions are effective and straightforward to configure. +- **Leverage Dynamic Assertions for Evolving Data**: When dealing with datasets that naturally fluctuate (e.g., sales or user activity), dynamic assertions can save time and ensure incidents are only triggered when significant anomalies occur. +- **Regularly Review Incidents**: Stay on top of incidents generated by anomaly detection to promptly identify and address data quality issues. +- **Combine Manual and Dynamic Methods**: For datasets with well-defined boundaries and evolving characteristics, combining manual thresholds and dynamic assertions provides comprehensive anomaly detection coverage.