From 620a3fc0f84537b83b10b094051f3acdabfb462b Mon Sep 17 00:00:00 2001 From: Sowmya N Dixit Date: Wed, 27 Dec 2023 11:36:33 +0530 Subject: [PATCH] Sunbird Obsrv opensource release 2.0.0-GA (#13) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * feat: added descriptions for default configurations * feat: added descriptions for default configurations * feat: modified kafka connector input topic * feat: obsrv setup instructions * feat: revisiting open source features * feat: masterdata processor job config * Build deploy v2 (#19) * #0 - Refactor Dockerfile and Github actions workflow --------- Co-authored-by: Santhosh Vasabhaktula Co-authored-by: ManojCKrishna * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * feat: update all failed, invalid and duplicate topic names * feat: update kafka topic names in test cases * #0 fix: add individual extraction * feat: update failed event * Update ErrorConstants.scala * feat: update failed event * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * feat: add exception handling for json deserialization * Update BaseProcessFunction.scala * Update BaseProcessFunction.scala * feat: update batch failed event generation * Update ExtractionFunction.scala * feat: update invalid json exception handling * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 fix: remove cloning object * Issue #46 feat: update batch failed event * #0 fix: update github actions release condition * Update DatasetModels.scala * Issue #46 feat: add error reasons * Release 1.3.0 into Main branch (#34) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * #0 fix: update github actions release condition --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * Update DatasetModels.scala * Issue #2 feat: Remove kafka connector code * Issue #46 feat: add exception stack trace * Issue #46 feat: add exception stack trace * feat: add function to get all datasets * Release 1.3.1 Changes (#42) * Dataset enhancements (#38) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * #0000 [SV] - Fallback to local redis instance if embedded redis is not starting * Update DatasetModels.scala * #0000 - refactor the denormalization logic 1. Do not fail the denormalization if the denorm key is missing 2. Add clear message whether the denorm is sucessful or failed or partially successful 3. Handle denorm for both text and number fields * #0000 - refactor: 1. Created a enum for dataset status and ignore events if the dataset is not in Live status 2. Created a outputtag for denorm failed stats 3. Parse event validation failed messages into a case class * #0000 - refactor: 1. Updated the DruidRouter job to publish data to router topics dynamically 2. Updated framework to created dynamicKafkaSink object * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Added validation to check if the event has a timestamp key and it is not blank nor invalid 2. Added timezone handling to store the data in druid in the TZ specified by the dataset * #0000 - minor refactoring: Updated DatasetRegistry.getDatasetSourceConfig to getAllDatasetSourceConfig * #0000 - mega refactoring: Refactored logs, error messages and metrics * #0000 - mega refactoring: Fix unit tests * #0000 - refactoring: 1. Introduced transformation mode to enable lenient transformations 2. Proper exception handling for transformer job * #0000 - refactoring: Fix test cases and code * #0000 - refactoring: upgrade embedded redis to work with macos sonoma m2 * #0000 - refactoring: Denormalizer test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Router test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Validator test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Framework test cases and bug fixes * #0000 - refactoring: kafka connector test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: improve code coverage and fix bugs * #0000 - refactoring: improve code coverage and fix bugs --- Now the code coverage is 100% * #0000 - refactoring: organize imports * #0000 - refactoring: 1. transformer test cases and bug fixes - code coverage is 100% * #0000 - refactoring: test cases and bug fixes --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: Manjunath Davanam Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * #000:feat: Removed the provided scope of the kafka-client in the framework (#40) * #0000 - feat: Add dataset-type to system events (#41) * #0000 - feat: Add dataset-type to system events * #0000 - feat: Modify tests for dataset-type in system events * #0000 - feat: Remove unused getDatasetType function * #0000 - feat: Remove unused pom test dependencies * #0000 - feat: Remove unused pom test dependencies --------- Co-authored-by: Santhosh Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * Main conflicts fixes (#44) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * Release 1.3.0 into Main branch (#34) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * #0 fix: update github actions release condition --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * Update DatasetModels.scala * Issue #2 feat: Remove kafka connector code * feat: add function to get all datasets * #000:feat: Resolve conflicts --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula * Release 1.3.1 into Main (#43) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * feat: update all failed, invalid and duplicate topic names * feat: update kafka topic names in test cases * #0 fix: add individual extraction * feat: update failed event * Update ErrorConstants.scala * feat: update failed event * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * feat: add exception handling for json deserialization * Update BaseProcessFunction.scala * Update BaseProcessFunction.scala * feat: update batch failed event generation * Update ExtractionFunction.scala * feat: update invalid json exception handling * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 fix: remove cloning object * Issue #46 feat: update batch failed event * #0 fix: update github actions release condition * Issue #46 feat: add error reasons * Issue #46 feat: add exception stack trace * Issue #46 feat: add exception stack trace * Release 1.3.1 Changes (#42) * Dataset enhancements (#38) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * #0000 [SV] - Fallback to local redis instance if embedded redis is not starting * Update DatasetModels.scala * #0000 - refactor the denormalization logic 1. Do not fail the denormalization if the denorm key is missing 2. Add clear message whether the denorm is sucessful or failed or partially successful 3. Handle denorm for both text and number fields * #0000 - refactor: 1. Created a enum for dataset status and ignore events if the dataset is not in Live status 2. Created a outputtag for denorm failed stats 3. Parse event validation failed messages into a case class * #0000 - refactor: 1. Updated the DruidRouter job to publish data to router topics dynamically 2. Updated framework to created dynamicKafkaSink object * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Added validation to check if the event has a timestamp key and it is not blank nor invalid 2. Added timezone handling to store the data in druid in the TZ specified by the dataset * #0000 - minor refactoring: Updated DatasetRegistry.getDatasetSourceConfig to getAllDatasetSourceConfig * #0000 - mega refactoring: Refactored logs, error messages and metrics * #0000 - mega refactoring: Fix unit tests * #0000 - refactoring: 1. Introduced transformation mode to enable lenient transformations 2. Proper exception handling for transformer job * #0000 - refactoring: Fix test cases and code * #0000 - refactoring: upgrade embedded redis to work with macos sonoma m2 * #0000 - refactoring: Denormalizer test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Router test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Validator test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Framework test cases and bug fixes * #0000 - refactoring: kafka connector test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: improve code coverage and fix bugs * #0000 - refactoring: improve code coverage and fix bugs --- Now the code coverage is 100% * #0000 - refactoring: organize imports * #0000 - refactoring: 1. transformer test cases and bug fixes - code coverage is 100% * #0000 - refactoring: test cases and bug fixes --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: Manjunath Davanam Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * #000:feat: Removed the provided scope of the kafka-client in the framework (#40) * #0000 - feat: Add dataset-type to system events (#41) * #0000 - feat: Add dataset-type to system events * #0000 - feat: Modify tests for dataset-type in system events * #0000 - feat: Remove unused getDatasetType function * #0000 - feat: Remove unused pom test dependencies * #0000 - feat: Remove unused pom test dependencies --------- Co-authored-by: Santhosh Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * Main conflicts fixes (#44) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * Release 1.3.0 into Main branch (#34) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * #0 fix: update github actions release condition --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * Update DatasetModels.scala * Issue #2 feat: Remove kafka connector code * feat: add function to get all datasets * #000:feat: Resolve conflicts --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: shiva-rakshith Co-authored-by: Manjunath Davanam Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Aniket Sakinala Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula * update workflow file to skip tests (#45) * Release 1.3.1 into Main (#49) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * feat: update all failed, invalid and duplicate topic names * feat: update kafka topic names in test cases * #0 fix: add individual extraction * feat: update failed event * Update ErrorConstants.scala * feat: update failed event * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * feat: add exception handling for json deserialization * Update BaseProcessFunction.scala * Update BaseProcessFunction.scala * feat: update batch failed event generation * Update ExtractionFunction.scala * feat: update invalid json exception handling * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 fix: remove cloning object * Issue #46 feat: update batch failed event * #0 fix: update github actions release condition * Issue #46 feat: add error reasons * Issue #46 feat: add exception stack trace * Issue #46 feat: add exception stack trace * Release 1.3.1 Changes (#42) * Dataset enhancements (#38) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * #0000 [SV] - Fallback to local redis instance if embedded redis is not starting * Update DatasetModels.scala * #0000 - refactor the denormalization logic 1. Do not fail the denormalization if the denorm key is missing 2. Add clear message whether the denorm is sucessful or failed or partially successful 3. Handle denorm for both text and number fields * #0000 - refactor: 1. Created a enum for dataset status and ignore events if the dataset is not in Live status 2. Created a outputtag for denorm failed stats 3. Parse event validation failed messages into a case class * #0000 - refactor: 1. Updated the DruidRouter job to publish data to router topics dynamically 2. Updated framework to created dynamicKafkaSink object * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Added validation to check if the event has a timestamp key and it is not blank nor invalid 2. Added timezone handling to store the data in druid in the TZ specified by the dataset * #0000 - minor refactoring: Updated DatasetRegistry.getDatasetSourceConfig to getAllDatasetSourceConfig * #0000 - mega refactoring: Refactored logs, error messages and metrics * #0000 - mega refactoring: Fix unit tests * #0000 - refactoring: 1. Introduced transformation mode to enable lenient transformations 2. Proper exception handling for transformer job * #0000 - refactoring: Fix test cases and code * #0000 - refactoring: upgrade embedded redis to work with macos sonoma m2 * #0000 - refactoring: Denormalizer test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Router test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Validator test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Framework test cases and bug fixes * #0000 - refactoring: kafka connector test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: improve code coverage and fix bugs * #0000 - refactoring: improve code coverage and fix bugs --- Now the code coverage is 100% * #0000 - refactoring: organize imports * #0000 - refactoring: 1. transformer test cases and bug fixes - code coverage is 100% * #0000 - refactoring: test cases and bug fixes --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: Manjunath Davanam Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * #000:feat: Removed the provided scope of the kafka-client in the framework (#40) * #0000 - feat: Add dataset-type to system events (#41) * #0000 - feat: Add dataset-type to system events * #0000 - feat: Modify tests for dataset-type in system events * #0000 - feat: Remove unused getDatasetType function * #0000 - feat: Remove unused pom test dependencies * #0000 - feat: Remove unused pom test dependencies --------- Co-authored-by: Santhosh Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Anand Parthasarathy * Main conflicts fixes (#44) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * Release 1.3.0 into Main branch (#34) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * #0 fix: update github actions release condition --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit * Update DatasetModels.scala * Issue #2 feat: Remove kafka connector code * feat: add function to get all datasets * #000:feat: Resolve conflicts --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula * #0000 - fix: Fix null dataset_type in DruidRouterFunction (#48) --------- Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: shiva-rakshith Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Aniket Sakinala Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula * Develop to Release-1.0.0-GA (#52) (#53) * testing new images * testing new images * testing new images * testing new images * testing new images * build new image with bug fixes * update dockerfile * update dockerfile * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * feat: update all failed, invalid and duplicate topic names * feat: update kafka topic names in test cases * #0 fix: add individual extraction * feat: update failed event * Update ErrorConstants.scala * feat: update failed event * Issue #0 fix: upgrade ubuntu packages for vulnerabilities * feat: add exception handling for json deserialization * Update BaseProcessFunction.scala * Update BaseProcessFunction.scala * feat: update batch failed event generation * Update ExtractionFunction.scala * feat: update invalid json exception handling * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 feat: update batch failed event * Issue #46 fix: remove cloning object * Issue #46 feat: update batch failed event * #0 fix: update github actions release condition * Issue #46 feat: add error reasons * Issue #46 feat: add exception stack trace * Issue #46 feat: add exception stack trace * Dataset enhancements (#38) * feat: add connector config and connector stats update functions * Issue #33 feat: add documentation for Dataset, Datasources, Data In and Query APIs * Update DatasetModels.scala * #0 fix: upgrade packages * #0 feat: add flink dockerfiles * #0 fix: add individual extraction --------- * #0000 [SV] - Fallback to local redis instance if embedded redis is not starting * Update DatasetModels.scala * #0000 - refactor the denormalization logic 1. Do not fail the denormalization if the denorm key is missing 2. Add clear message whether the denorm is sucessful or failed or partially successful 3. Handle denorm for both text and number fields * #0000 - refactor: 1. Created a enum for dataset status and ignore events if the dataset is not in Live status 2. Created a outputtag for denorm failed stats 3. Parse event validation failed messages into a case class * #0000 - refactor: 1. Updated the DruidRouter job to publish data to router topics dynamically 2. Updated framework to created dynamicKafkaSink object * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Made calls to getAllDatasets and getAllDatasetSources to always query postgres 2. Created BaseDatasetProcessFunction for all flink functions to extend that would dynamically resolve dataset config, initialize metrics and handle common failures 3. Refactored serde - merged map and string serialization into one function and parameterized the function 4. Moved failed events sinking into a common base class 5. Master dataset processor can now do denormalization with another master dataset as well * #0000 - mega refactoring: 1. Added validation to check if the event has a timestamp key and it is not blank nor invalid 2. Added timezone handling to store the data in druid in the TZ specified by the dataset * #0000 - minor refactoring: Updated DatasetRegistry.getDatasetSourceConfig to getAllDatasetSourceConfig * #0000 - mega refactoring: Refactored logs, error messages and metrics * #0000 - mega refactoring: Fix unit tests * #0000 - refactoring: 1. Introduced transformation mode to enable lenient transformations 2. Proper exception handling for transformer job * #0000 - refactoring: Fix test cases and code * #0000 - refactoring: upgrade embedded redis to work with macos sonoma m2 * #0000 - refactoring: Denormalizer test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Router test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Validator test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: Framework test cases and bug fixes * #0000 - refactoring: kafka connector test cases and bug fixes. Code coverage is 100% now * #0000 - refactoring: improve code coverage and fix bugs * #0000 - refactoring: improve code coverage and fix bugs --- Now the code coverage is 100% * #0000 - refactoring: organize imports * #0000 - refactoring: 1. transformer test cases and bug fixes - code coverage is 100% * #0000 - refactoring: test cases and bug fixes --------- * #000:feat: Removed the provided scope of the kafka-client in the framework (#40) * #0000 - feat: Add dataset-type to system events (#41) * #0000 - feat: Add dataset-type to system events * #0000 - feat: Modify tests for dataset-type in system events * #0000 - feat: Remove unused getDatasetType function * #0000 - feat: Remove unused pom test dependencies * #0000 - feat: Remove unused pom test dependencies * #67 feat: query system configurations from meta store * #67 fix: Refactor system configuration retrieval and update dynamic router function * #67 fix: update system config according to review * #67 fix: update test cases for system config * #67 fix: update default values in test cases * #67 fix: add get all system settings method and update test cases * #67 fix: add test case for covering exception case * #67 fix: fix data types in test cases * #67 fix: Refactor event indexing in DynamicRouterFunction * Issue #67 refactor: SystemConfig read from DB implementation * #226 fix: update test cases according to the refactor --------- Co-authored-by: Manjunath Davanam Co-authored-by: ManojKrishnaChintaluri Co-authored-by: shiva-rakshith Co-authored-by: Sowmya N Dixit Co-authored-by: Santhosh Co-authored-by: Aniket Sakinala Co-authored-by: Anand Parthasarathy * #0 fix: Flink base image updates --------- Co-authored-by: shiva-rakshith Co-authored-by: Aniket Sakinala Co-authored-by: GayathriSrividya Co-authored-by: Manjunath Davanam Co-authored-by: Manoj Krishna <92361832+ManojKrishnaChintauri@users.noreply.github.com> Co-authored-by: Santhosh Vasabhaktula Co-authored-by: ManojCKrishna Co-authored-by: ManojKrishnaChintaluri Co-authored-by: Praveen <66662436+pveleneni@users.noreply.github.com> Co-authored-by: Anand Parthasarathy Co-authored-by: Ravi Mula Co-authored-by: Manoj Krishna <92361832+ManojKrishnaChintaluri@users.noreply.github.com> --- .github/workflows/build_and_deploy.yaml | 2 +- Dockerfile | 2 +- data-products/pom.xml | 13 +- .../MasterDataProcessorIndexer.scala | 6 +- dataset-registry/pom.xml | 4 +- .../src/main/resources/dataset-registry.sql | 9 +- .../sunbird/obsrv/model/DatasetModels.scala | 45 ++-- .../obsrv/registry/DatasetRegistry.scala | 44 ++-- .../service/DatasetRegistryService.scala | 113 +++++----- .../BaseDatasetProcessFunction.scala | 195 ++++++++++++++++ .../{base-config.conf => baseconfig.conf} | 0 .../spec/BaseSpecWithDatasetRegistry.scala | 40 +++- .../obsrv/spec/TestDatasetRegistrySpec.scala | 92 ++++++-- framework/pom.xml | 11 +- framework/src/main/resources/baseconfig.conf | 6 +- .../obsrv/core/cache/DedupEngine.scala | 26 +-- .../obsrv/core/cache/RedisConnect.scala | 1 + .../sunbird/obsrv/core/model/Constants.scala | 18 ++ .../obsrv/core/model/ErrorConstants.scala | 27 ++- .../org/sunbird/obsrv/core/model/Models.scala | 80 ++++++- .../obsrv/core/model/SystemConfig.scala | 116 +++++++++- .../serde/{MapSerde.scala => SerdeUtil.scala} | 44 +++- .../obsrv/core/serde/StringSerde.scala | 32 --- .../core/streaming/BaseDeduplication.scala | 38 +--- .../obsrv/core/streaming/BaseJobConfig.scala | 18 +- .../core/streaming/BaseProcessFunction.scala | 132 ++++++----- .../obsrv/core/streaming/BaseStreamTask.scala | 15 +- .../core/streaming/FlinkKafkaConnector.scala | 28 +-- .../sunbird/obsrv/core/util/JSONUtil.scala | 28 ++- .../obsrv/core/util/PostgresConnect.scala | 16 +- .../org/sunbird/obsrv/core/util/Util.scala | 6 +- framework/src/test/resources/base-test.conf | 3 +- framework/src/test/resources/test.conf | 4 +- framework/src/test/resources/test2.conf | 69 ++++++ .../spec/BaseDeduplicationTestSpec.scala | 45 ++++ .../spec/BaseProcessFunctionTestSpec.scala | 44 ++-- .../sunbird/spec/BaseProcessTestConfig.scala | 6 +- .../scala/org/sunbird/spec/BaseSpec.scala | 10 +- .../sunbird/spec/BaseSpecWithPostgres.scala | 21 +- .../org/sunbird/spec/ModelsTestSpec.scala | 118 ++++++++++ .../sunbird/spec/PostgresConnectSpec.scala | 2 +- .../org/sunbird/spec/RedisTestSpec.scala | 24 +- .../org/sunbird/spec/SerdeUtilTestSpec.scala | 75 ++++++ .../org/sunbird/spec/SystemConfigSpec.scala | 114 ++++++++++ .../org/sunbird/spec/TestMapStreamFunc.scala | 33 ++- .../org/sunbird/spec/TestMapStreamTask.scala | 4 +- .../sunbird/spec/TestStringStreamTask.scala | 6 +- pipeline/denormalizer/pom.xml | 38 +++- .../src/main/resources/de-normalization.conf | 2 +- .../functions/DenormalizerFunction.scala | 88 ++++++-- .../DenormalizerWindowFunction.scala | 86 +++++-- .../task/DenormalizerConfig.scala | 13 +- .../task/DenormalizerStreamTask.scala | 13 +- .../task/DenormalizerWindowStreamTask.scala | 23 +- .../obsrv/denormalizer/util/DenormCache.scala | 102 +++++---- .../denormalizer/src/test/resources/test.conf | 4 +- .../DenormalizerStreamTaskTestSpec.scala | 176 +++++++++++++++ ...DenormalizerWindowStreamTaskTestSpec.scala | 213 ++++++++++++++++++ .../obsrv/denormalizer/EventFixture.scala | 15 ++ pipeline/druid-router/pom.xml | 54 +++-- .../functions/DruidRouterFunction.scala | 34 +-- .../functions/DynamicRouterFunction.scala | 116 ++++++++++ .../obsrv/router/task/DruidRouterConfig.scala | 3 + .../router/task/DruidRouterStreamTask.scala | 14 +- .../router/task/DynamicRouterStreamTask.scala | 66 ++++++ .../DynamicRouterStreamTaskTestSpec.scala | 171 ++++++++++++++ .../sunbird/obsrv/router/EventFixture.scala | 7 + .../obsrv/router/TestTimestampKeyParser.scala | 124 ++++++++++ pipeline/extractor/pom.xml | 35 ++- .../src/main/resources/extractor.conf | 5 +- .../functions/ExtractionFunction.scala | 116 +++++++--- .../extractor/task/ExtractorConfig.scala | 26 +-- .../extractor/task/ExtractorStreamTask.scala | 21 +- .../extractor/src/test/resources/test.conf | 10 +- .../extractor/src/test/resources/test2.conf | 24 ++ .../obsrv/extractor/EventFixture.scala | 15 ++ .../extractor/ExtractorStreamTestSpec.scala | 167 ++++++++++++++ pipeline/kafka-connector/pom.xml | 37 ++- .../src/main/resources/kafka-connector.conf | 1 - .../task/KafkaConnectorConfig.scala | 9 +- .../task/KafkaConnectorStreamTask.scala | 43 ++-- .../src/test/resources/test.conf | 14 ++ .../KafkaConnectorStreamTestSpec.scala | 126 +++++++++++ pipeline/master-data-processor/pom.xml | 12 +- .../main/resources/master-data-processor.conf | 7 +- .../MasterDataProcessorFunction.scala | 62 +++-- .../task/MasterDataProcessorConfig.scala | 9 +- .../task/MasterDataProcessorStreamTask.scala | 21 +- .../obsrv/pipeline/util/MasterDataCache.scala | 16 +- .../src/test/resources/test.conf | 10 +- .../sunbird/obsrv/fixture/EventFixture.scala | 4 +- ...asterDataProcessorStreamTaskTestSpec.scala | 46 +++- pipeline/pipeline-merged/pom.xml | 7 +- .../src/main/resources/merged-pipeline.conf | 11 +- .../pipeline/task/MergedPipelineConfig.scala | 16 +- .../task/MergedPipelineStreamTask.scala | 6 +- .../src/test/resources/test.conf | 11 +- .../MergedPipelineStreamTaskTestSpec.scala | 78 ++++++- pipeline/pom.xml | 6 +- pipeline/preprocessor/pom.xml | 4 +- .../main/resources/pipeline-preprocessor.conf | 5 +- .../functions/DeduplicationFunction.scala | 52 +++-- .../functions/EventValidationFunction.scala | 166 +++++++++----- .../task/PipelinePreprocessorConfig.scala | 23 +- .../task/PipelinePreprocessorStreamTask.scala | 14 +- .../preprocessor/util/SchemaValidator.scala | 57 ++--- .../preprocessor/src/test/resources/test.conf | 5 +- .../PipelinePreprocessorStreamTestSpec.scala | 149 ++++++++++-- .../preprocessor/TestSchemaValidator.scala | 84 +++++-- .../preprocessor/fixture/EventFixtures.scala | 16 +- pipeline/transformer/pom.xml | 4 +- .../functions/TransformerFunction.scala | 36 ++- .../transformer/task/TransformerConfig.scala | 7 +- .../task/TransformerStreamTask.scala | 9 +- pom.xml | 4 - 115 files changed, 3759 insertions(+), 994 deletions(-) create mode 100644 dataset-registry/src/main/scala/org/sunbird/obsrv/streaming/BaseDatasetProcessFunction.scala rename dataset-registry/src/test/resources/{base-config.conf => baseconfig.conf} (100%) create mode 100644 framework/src/main/scala/org/sunbird/obsrv/core/model/Constants.scala rename framework/src/main/scala/org/sunbird/obsrv/core/serde/{MapSerde.scala => SerdeUtil.scala} (55%) delete mode 100644 framework/src/main/scala/org/sunbird/obsrv/core/serde/StringSerde.scala create mode 100644 framework/src/test/resources/test2.conf create mode 100644 framework/src/test/scala/org/sunbird/spec/BaseDeduplicationTestSpec.scala create mode 100644 framework/src/test/scala/org/sunbird/spec/ModelsTestSpec.scala create mode 100644 framework/src/test/scala/org/sunbird/spec/SerdeUtilTestSpec.scala create mode 100644 framework/src/test/scala/org/sunbird/spec/SystemConfigSpec.scala create mode 100644 pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerStreamTaskTestSpec.scala create mode 100644 pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerWindowStreamTaskTestSpec.scala create mode 100644 pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/EventFixture.scala create mode 100644 pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DynamicRouterFunction.scala create mode 100644 pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DynamicRouterStreamTask.scala create mode 100644 pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/DynamicRouterStreamTaskTestSpec.scala create mode 100644 pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/EventFixture.scala create mode 100644 pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/TestTimestampKeyParser.scala create mode 100644 pipeline/extractor/src/test/resources/test2.conf create mode 100644 pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/EventFixture.scala create mode 100644 pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/ExtractorStreamTestSpec.scala rename pipeline/kafka-connector/src/main/scala/org/sunbird/obsrv/{kafkaconnector => connector}/task/KafkaConnectorConfig.scala (73%) rename pipeline/kafka-connector/src/main/scala/org/sunbird/obsrv/{kafkaconnector => connector}/task/KafkaConnectorStreamTask.scala (68%) create mode 100644 pipeline/kafka-connector/src/test/resources/test.conf create mode 100644 pipeline/kafka-connector/src/test/scala/org/sunbird/obsrv/connector/KafkaConnectorStreamTestSpec.scala diff --git a/.github/workflows/build_and_deploy.yaml b/.github/workflows/build_and_deploy.yaml index 90b01883..48c610c9 100644 --- a/.github/workflows/build_and_deploy.yaml +++ b/.github/workflows/build_and_deploy.yaml @@ -48,7 +48,7 @@ jobs: fetch-depth: 0 - name: Maven Build run: | - mvn clean install + mvn clean install -DskipTests - name: Set up Docker Buildx uses: docker/setup-buildx-action@v2 diff --git a/Dockerfile b/Dockerfile index 8f7e4a7a..b9f41aa8 100644 --- a/Dockerfile +++ b/Dockerfile @@ -38,4 +38,4 @@ COPY --from=build-pipeline /app/pipeline/master-data-processor/target/master-dat FROM --platform=linux/x86_64 sunbird/flink:1.15.2-scala_2.12-jdk-11 as kafka-connector-image USER flink -COPY --from=build-pipeline /app/pipeline/kafka-connector/target/kafka-connector-1.0.0.jar $FLINK_HOME/lib +COPY --from=build-pipeline /app/pipeline/kafka-connector/target/kafka-connector-1.0.0.jar $FLINK_HOME/lib \ No newline at end of file diff --git a/data-products/pom.xml b/data-products/pom.xml index e79564e5..51090a71 100644 --- a/data-products/pom.xml +++ b/data-products/pom.xml @@ -10,7 +10,7 @@ 3.1.0 - 2.12.10 + 2.12.11 2.12 1.1.1 @@ -225,6 +225,17 @@ + + + org.scoverage + scoverage-maven-plugin + ${scoverage.plugin.version} + + ${scala.version} + true + true + + diff --git a/data-products/src/main/scala/org/sunbird/obsrv/dataproducts/MasterDataProcessorIndexer.scala b/data-products/src/main/scala/org/sunbird/obsrv/dataproducts/MasterDataProcessorIndexer.scala index 7823c7cb..e1ecfdec 100644 --- a/data-products/src/main/scala/org/sunbird/obsrv/dataproducts/MasterDataProcessorIndexer.scala +++ b/data-products/src/main/scala/org/sunbird/obsrv/dataproducts/MasterDataProcessorIndexer.scala @@ -88,7 +88,7 @@ object MasterDataProcessorIndexer { val response = Unirest.post(config.getString("druid.indexer.url")) .header("Content-Type", "application/json") .body(ingestionSpec).asJson() - response.ifFailure(response => throw new Exception("Exception while submitting ingestion task")) + response.ifFailure(_ => throw new Exception("Exception while submitting ingestion task")) } private def updateDataSourceRef(datasource: DataSource, datasourceRef: String): Unit = { @@ -100,7 +100,7 @@ object MasterDataProcessorIndexer { val response = Unirest.delete(config.getString("druid.datasource.delete.url") + datasourceRef) .header("Content-Type", "application/json") .asJson() - response.ifFailure(response => throw new Exception("Exception while deleting datasource" + datasourceRef)) + response.ifFailure(_ => throw new Exception("Exception while deleting datasource" + datasourceRef)) } private def createDataFile(dataset: Dataset, timestamp: Long, outputFilePath: String, objectKey: String): String = { @@ -115,7 +115,7 @@ object MasterDataProcessorIndexer { val sc = new SparkContext(conf) val readWriteConf = ReadWriteConfig(scanCount = 1000, maxPipelineSize = 1000) - val rdd = sc.fromRedisKV("*")(readWriteConfig = readWriteConf) + sc.fromRedisKV("*")(readWriteConfig = readWriteConf) .map(f => JSONUtil.deserialize[mutable.Map[String, AnyRef]](f._2)) .map(f => f.put("syncts", timestamp.asInstanceOf[AnyRef])) .map(f => JSONUtil.serialize(f)) diff --git a/dataset-registry/pom.xml b/dataset-registry/pom.xml index e3950291..fd17db70 100644 --- a/dataset-registry/pom.xml +++ b/dataset-registry/pom.xml @@ -62,9 +62,9 @@ test - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/dataset-registry/src/main/resources/dataset-registry.sql b/dataset-registry/src/main/resources/dataset-registry.sql index aa997ebe..ff28ae98 100644 --- a/dataset-registry/src/main/resources/dataset-registry.sql +++ b/dataset-registry/src/main/resources/dataset-registry.sql @@ -41,8 +41,9 @@ CREATE TABLE IF NOT EXISTS dataset_transformations ( id text PRIMARY KEY, dataset_id text REFERENCES datasets (id), field_key text NOT NULL, - transformation_function text NOT NULL, + transformation_function json NOT NULL, status text NOT NULL, + mode text, created_by text NOT NULL, updated_by text NOT NULL, created_date timestamp NOT NULL, @@ -53,17 +54,17 @@ CREATE INDEX IF NOT EXISTS dataset_transformations_status ON dataset_transformat CREATE INDEX IF NOT EXISTS dataset_transformations_dataset ON dataset_transformations(dataset_id); CREATE TABLE IF NOT EXISTS dataset_source_config ( - id SERIAL PRIMARY KEY, + id text PRIMARY KEY, dataset_id text NOT NULL REFERENCES datasets (id), connector_type text NOT NULL, connector_config json NOT NULL, - connector_stats json NOT NULL, + connector_stats json, status text NOT NULL, created_by text NOT NULL, updated_by text NOT NULL, created_date timestamp NOT NULL, updated_date timestamp NOT NULL, - UNIQUE(dataset_id) + UNIQUE(connector_type, dataset_id) ); CREATE INDEX IF NOT EXISTS dataset_source_config_status ON dataset_source_config(status); CREATE INDEX IF NOT EXISTS dataset_source_config_dataset ON dataset_source_config(dataset_id); \ No newline at end of file diff --git a/dataset-registry/src/main/scala/org/sunbird/obsrv/model/DatasetModels.scala b/dataset-registry/src/main/scala/org/sunbird/obsrv/model/DatasetModels.scala index cdfcb0a7..49cc51bc 100644 --- a/dataset-registry/src/main/scala/org/sunbird/obsrv/model/DatasetModels.scala +++ b/dataset-registry/src/main/scala/org/sunbird/obsrv/model/DatasetModels.scala @@ -4,8 +4,11 @@ import com.fasterxml.jackson.annotation.JsonProperty import com.fasterxml.jackson.core.`type`.TypeReference import com.fasterxml.jackson.module.scala.JsonScalaEnumeration import org.sunbird.obsrv.core.model.SystemConfig +import org.sunbird.obsrv.model.DatasetStatus.DatasetStatus +import org.sunbird.obsrv.model.TransformMode.TransformMode import org.sunbird.obsrv.model.ValidationMode.ValidationMode +import java.sql.Timestamp import scala.beans.BeanProperty object DatasetModels { @@ -17,7 +20,7 @@ object DatasetModels { case class DedupConfig(@JsonProperty("drop_duplicates") dropDuplicates: Option[Boolean] = Some(false), @JsonProperty("dedup_key") dedupKey: Option[String], - @JsonProperty("dedup_period") dedupPeriod: Option[Integer] = Some(SystemConfig.defaultDedupPeriodInSeconds)) + @JsonProperty("dedup_period") dedupPeriod: Option[Integer] = Some(SystemConfig.getInt("defaultDedupPeriodInSeconds", 604800))) case class ValidationConfig(@JsonProperty("validate") validate: Option[Boolean] = Some(true), @JsonProperty("mode") @JsonScalaEnumeration(classOf[ValidationModeType]) mode: Option[ValidationMode]) @@ -30,15 +33,16 @@ object DatasetModels { case class RouterConfig(@JsonProperty("topic") topic: String) - case class DatasetConfig(@JsonProperty("data_key") key: String, @JsonProperty("timestamp_key") tsKey: String, - @JsonProperty("entry_topic") entryTopic: String, @JsonProperty("exclude_fields") excludeFields: Option[List[String]] = None, - @JsonProperty("redis_db_host") redisDBHost: Option[String] = None, @JsonProperty("redis_db_port") redisDBPort: Option[Int] = None, - @JsonProperty("redis_db") redisDB: Option[Int] = None, @JsonProperty("index_data") indexData: Option[Boolean] = None) + case class DatasetConfig(@JsonProperty("data_key") key: String, @JsonProperty("timestamp_key") tsKey: String, @JsonProperty("entry_topic") entryTopic: String, + @JsonProperty("exclude_fields") excludeFields: Option[List[String]] = None, @JsonProperty("redis_db_host") redisDBHost: Option[String] = None, + @JsonProperty("redis_db_port") redisDBPort: Option[Int] = None, @JsonProperty("redis_db") redisDB: Option[Int] = None, + @JsonProperty("index_data") indexData: Option[Boolean] = None, @JsonProperty("timestamp_format") tsFormat: Option[String] = None, + @JsonProperty("dataset_tz") datasetTimezone: Option[String] = None) - case class Dataset(@JsonProperty("id") id: String, @JsonProperty("type") datasetType: String , @JsonProperty("extraction_config") extractionConfig: Option[ExtractionConfig], + case class Dataset(@JsonProperty("id") id: String, @JsonProperty("type") datasetType: String, @JsonProperty("extraction_config") extractionConfig: Option[ExtractionConfig], @JsonProperty("dedup_config") dedupConfig: Option[DedupConfig], @JsonProperty("validation_config") validationConfig: Option[ValidationConfig], @JsonProperty("data_schema") jsonSchema: Option[String], @JsonProperty("denorm_config") denormConfig: Option[DenormConfig], - @JsonProperty("router_config") routerConfig: RouterConfig, datasetConfig: DatasetConfig, @JsonProperty("status") status: String, + @JsonProperty("router_config") routerConfig: RouterConfig, datasetConfig: DatasetConfig, @JsonProperty("status") @JsonScalaEnumeration(classOf[DatasetStatusType]) status: DatasetStatus, @JsonProperty("tags") tags: Option[Array[String]] = None, @JsonProperty("data_version") dataVersion: Option[Int] = None) case class Condition(@JsonProperty("type") `type`: String, @JsonProperty("expr") expr: String) @@ -47,9 +51,9 @@ object DatasetModels { case class DatasetTransformation(@JsonProperty("id") id: String, @JsonProperty("dataset_id") datasetId: String, @JsonProperty("field_key") fieldKey: String, @JsonProperty("transformation_function") transformationFunction: TransformationFunction, - @JsonProperty("status") status: String) + @JsonProperty("status") status: String, @JsonProperty("mode") @JsonScalaEnumeration(classOf[TransformModeType]) mode: Option[TransformMode] = Some(TransformMode.Strict)) - case class ConnectorConfig(@JsonProperty("kafkaBrokers") kafkaBrokers: String, @JsonProperty("topic") topic: String, @JsonProperty("type")databaseType: String, + case class ConnectorConfig(@JsonProperty("kafkaBrokers") kafkaBrokers: String, @JsonProperty("topic") topic: String, @JsonProperty("type") databaseType: String, @JsonProperty("connection") connection: Connection, @JsonProperty("tableName") tableName: String, @JsonProperty("databaseName") databaseName: String, @JsonProperty("pollingInterval") pollingInterval: PollingInterval, @JsonProperty("authenticationMechanism") authenticationMechanism: AuthenticationMechanism, @JsonProperty("batchSize") batchSize: Int, @JsonProperty("timestampColumn") timestampColumn: String) @@ -60,19 +64,34 @@ object DatasetModels { case class AuthenticationMechanism(@JsonProperty("encrypted") encrypted: Boolean, @JsonProperty("encryptedValues") encryptedValues: String) - case class ConnectorStats(@JsonProperty("last_fetch_timestamp") lastFetchTimestamp: String, @JsonProperty("records") records: Long, @JsonProperty("avg_batch_read_time") avgBatchReadTime: Long, @JsonProperty("disconnections") disconnections: Int) + case class ConnectorStats(@JsonProperty("last_fetch_timestamp") lastFetchTimestamp: Timestamp, @JsonProperty("records") records: Long, @JsonProperty("avg_batch_read_time") avgBatchReadTime: Long, @JsonProperty("disconnections") disconnections: Int) case class DatasetSourceConfig(@JsonProperty("id") id: String, @JsonProperty("dataset_id") datasetId: String, @JsonProperty("connector_type") connectorType: String, @JsonProperty("connector_config") connectorConfig: ConnectorConfig, - @JsonProperty("connector_stats") connectorStats: ConnectorStats, @JsonProperty("status") status: String) - case class DataSource(@JsonProperty("datasource") datasource: String, @JsonProperty("dataset_id") datasetId: String, - @JsonProperty("ingestion_spec") ingestionSpec: String, @JsonProperty("datasource_ref") datasourceRef: String) + @JsonProperty("status") status: String, @JsonProperty("connector_stats") connectorStats: Option[ConnectorStats] = None) + case class DataSource(@JsonProperty("id") id: String, @JsonProperty("datasource") datasource: String, @JsonProperty("dataset_id") datasetId: String, + @JsonProperty("ingestion_spec") ingestionSpec: String, @JsonProperty("datasource_ref") datasourceRef: String) } class ValidationModeType extends TypeReference[ValidationMode.type] + object ValidationMode extends Enumeration { type ValidationMode = Value val Strict, IgnoreNewFields, DiscardNewFields = Value } + +class TransformModeType extends TypeReference[TransformMode.type] + +object TransformMode extends Enumeration { + type TransformMode = Value + val Strict, Lenient = Value +} + +class DatasetStatusType extends TypeReference[DatasetStatus.type] + +object DatasetStatus extends Enumeration { + type DatasetStatus = Value + val Draft, Publish, Live, Retired, Purged = Value +} \ No newline at end of file diff --git a/dataset-registry/src/main/scala/org/sunbird/obsrv/registry/DatasetRegistry.scala b/dataset-registry/src/main/scala/org/sunbird/obsrv/registry/DatasetRegistry.scala index e71a0915..ad239312 100644 --- a/dataset-registry/src/main/scala/org/sunbird/obsrv/registry/DatasetRegistry.scala +++ b/dataset-registry/src/main/scala/org/sunbird/obsrv/registry/DatasetRegistry.scala @@ -4,56 +4,62 @@ import org.sunbird.obsrv.model.DatasetModels.{DataSource, Dataset, DatasetSource import org.sunbird.obsrv.service.DatasetRegistryService import java.sql.Timestamp +import scala.collection.mutable object DatasetRegistry { - private val datasets: Map[String, Dataset] = DatasetRegistryService.readAllDatasets() + private val datasets: mutable.Map[String, Dataset] = mutable.Map[String, Dataset]() + datasets ++= DatasetRegistryService.readAllDatasets() private val datasetTransformations: Map[String, List[DatasetTransformation]] = DatasetRegistryService.readAllDatasetTransformations() - private val datasetSourceConfig: Option[List[DatasetSourceConfig]] = DatasetRegistryService.readAllDatasetSourceConfig() - private val datasources: Map[String, List[DataSource]] = DatasetRegistryService.readAllDatasources() def getAllDatasets(datasetType: String): List[Dataset] = { - datasets.filter(f => f._2.datasetType.equals(datasetType)).values.toList + val datasetList = DatasetRegistryService.readAllDatasets() + datasetList.filter(f => f._2.datasetType.equals(datasetType)).values.toList } def getDataset(id: String): Option[Dataset] = { - datasets.get(id) + val datasetFromCache = datasets.get(id) + if (datasetFromCache.isDefined) datasetFromCache else { + val dataset = DatasetRegistryService.readDataset(id) + if (dataset.isDefined) datasets.put(dataset.get.id, dataset.get) + dataset + } } - def getDatasetSourceConfig(): Option[List[DatasetSourceConfig]] = { - datasetSourceConfig + def getAllDatasetSourceConfig(): Option[List[DatasetSourceConfig]] = { + DatasetRegistryService.readAllDatasetSourceConfig() } - def getDatasetSourceConfigById(datasetId: String): DatasetSourceConfig = { - datasetSourceConfig.map(configList => configList.filter(_.datasetId.equalsIgnoreCase(datasetId))).get.head + def getDatasetSourceConfigById(datasetId: String): Option[List[DatasetSourceConfig]] = { + DatasetRegistryService.readDatasetSourceConfig(datasetId) } - def getDatasetTransformations(id: String): Option[List[DatasetTransformation]] = { - datasetTransformations.get(id) + def getDatasetTransformations(datasetId: String): Option[List[DatasetTransformation]] = { + datasetTransformations.get(datasetId) } def getDatasources(datasetId: String): Option[List[DataSource]] = { - datasources.get(datasetId) + DatasetRegistryService.readDatasources(datasetId) } def getDataSetIds(datasetType: String): List[String] = { datasets.filter(f => f._2.datasetType.equals(datasetType)).keySet.toList } - def updateDatasourceRef(datasource: DataSource, datasourceRef: String): Unit = { + def updateDatasourceRef(datasource: DataSource, datasourceRef: String): Int = { DatasetRegistryService.updateDatasourceRef(datasource, datasourceRef) } - def updateConnectorStats(datasetId: String, lastFetchTimestamp: Timestamp, records: Long): Unit = { - DatasetRegistryService.updateConnectorStats(datasetId, lastFetchTimestamp, records) + def updateConnectorStats(id: String, lastFetchTimestamp: Timestamp, records: Long): Int = { + DatasetRegistryService.updateConnectorStats(id, lastFetchTimestamp, records) } - def updateConnectorDisconnections(datasetId: String, disconnections: Int): Unit = { - DatasetRegistryService.updateConnectorDisconnections(datasetId, disconnections) + def updateConnectorDisconnections(id: String, disconnections: Int): Int = { + DatasetRegistryService.updateConnectorDisconnections(id, disconnections) } - def updateConnectorAvgBatchReadTime(datasetId: String, avgReadTime: Long): Unit = { - DatasetRegistryService.updateConnectorAvgBatchReadTime(datasetId, avgReadTime) + def updateConnectorAvgBatchReadTime(id: String, avgReadTime: Long): Int = { + DatasetRegistryService.updateConnectorAvgBatchReadTime(id, avgReadTime) } } \ No newline at end of file diff --git a/dataset-registry/src/main/scala/org/sunbird/obsrv/service/DatasetRegistryService.scala b/dataset-registry/src/main/scala/org/sunbird/obsrv/service/DatasetRegistryService.scala index a6c0f99b..89efec4c 100644 --- a/dataset-registry/src/main/scala/org/sunbird/obsrv/service/DatasetRegistryService.scala +++ b/dataset-registry/src/main/scala/org/sunbird/obsrv/service/DatasetRegistryService.scala @@ -1,19 +1,17 @@ package org.sunbird.obsrv.service import com.typesafe.config.{Config, ConfigFactory} -import org.slf4j.LoggerFactory -import org.sunbird.obsrv.core.streaming.BaseDeduplication import org.sunbird.obsrv.core.util.{JSONUtil, PostgresConnect, PostgresConnectionConfig} -import org.sunbird.obsrv.model.DatasetModels.{ConnectorConfig, ConnectorStats, DataSource, Dataset, DatasetConfig, DatasetSourceConfig, DatasetTransformation, DedupConfig, DenormConfig, ExtractionConfig, RouterConfig, TransformationFunction, ValidationConfig} +import org.sunbird.obsrv.model.DatasetModels._ +import org.sunbird.obsrv.model.{DatasetStatus, TransformMode} import java.io.File import java.sql.{ResultSet, Timestamp} object DatasetRegistryService { - private[this] val logger = LoggerFactory.getLogger(DatasetRegistryService.getClass) - private val configFile = new File("/data/flink/conf/baseconfig.conf") + // $COVERAGE-OFF$ This code only executes within a flink cluster val config: Config = if (configFile.exists()) { println("Loading configuration file cluster baseconfig.conf...") ConfigFactory.parseFile(configFile).resolve() @@ -21,6 +19,7 @@ object DatasetRegistryService { println("Loading configuration file baseconfig.conf inside the jar...") ConfigFactory.load("baseconfig.conf").withFallback(ConfigFactory.systemEnvironment()) } + // $COVERAGE-ON$ private val postgresConfig = PostgresConnectionConfig( config.getString("postgres.user"), config.getString("postgres.password"), @@ -38,10 +37,21 @@ object DatasetRegistryService { val dataset = parseDataset(result) (dataset.id, dataset) }).toMap - } catch { - case ex: Exception => - logger.error("Exception while reading datasets from Postgres", ex) - Map() + } finally { + postgresConnect.closeConnection() + } + } + + def readDataset(id: String): Option[Dataset] = { + + val postgresConnect = new PostgresConnect(postgresConfig) + try { + val rs = postgresConnect.executeQuery(s"SELECT * FROM datasets where id='$id'") + if(rs.next()) { + Some(parseDataset(rs)) + } else { + None + } } finally { postgresConnect.closeConnection() } @@ -56,10 +66,20 @@ object DatasetRegistryService { val datasetSourceConfig = parseDatasetSourceConfig(result) datasetSourceConfig }).toList) - } catch { - case ex: Exception => - ex.printStackTrace() - None + } finally { + postgresConnect.closeConnection() + } + } + + def readDatasetSourceConfig(datasetId: String): Option[List[DatasetSourceConfig]] = { + + val postgresConnect = new PostgresConnect(postgresConfig) + try { + val rs = postgresConnect.executeQuery(s"SELECT * FROM dataset_source_config where dataset_id='$datasetId'") + Option(Iterator.continually((rs, rs.next)).takeWhile(f => f._2).map(f => f._1).map(result => { + val datasetSourceConfig = parseDatasetSourceConfig(result) + datasetSourceConfig + }).toList) } finally { postgresConnect.closeConnection() } @@ -74,66 +94,50 @@ object DatasetRegistryService { val dt = parseDatasetTransformation(result) (dt.datasetId, dt) }).toList.groupBy(f => f._1).mapValues(f => f.map(x => x._2)) - } catch { - case ex: Exception => - logger.error("Exception while reading dataset transformations from Postgres", ex) - Map() } finally { postgresConnect.closeConnection() } } - def readAllDatasources(): Map[String, List[DataSource]] = { + def readDatasources(datasetId: String): Option[List[DataSource]] = { val postgresConnect = new PostgresConnect(postgresConfig) try { - val rs = postgresConnect.executeQuery("SELECT * FROM datasources") - Iterator.continually((rs, rs.next)).takeWhile(f => f._2).map(f => f._1).map(result => { - val dt = parseDatasource(result) - (dt.datasetId, dt) - }).toList.groupBy(f => f._1).mapValues(f => f.map(x => x._2)) - } catch { - case ex: Exception => - logger.error("Exception while reading dataset transformations from Postgres", ex) - Map() + val rs = postgresConnect.executeQuery(s"SELECT * FROM datasources where dataset_id='$datasetId'") + Option(Iterator.continually((rs, rs.next)).takeWhile(f => f._2).map(f => f._1).map(result => { + parseDatasource(result) + }).toList) } finally { postgresConnect.closeConnection() } } - def updateDatasourceRef(datasource: DataSource, datasourceRef: String): Unit = { + def updateDatasourceRef(datasource: DataSource, datasourceRef: String): Int = { val query = s"UPDATE datasources set datasource_ref = '$datasourceRef' where datasource='${datasource.datasource}' and dataset_id='${datasource.datasetId}'" - updateRegistry(query, "Exception while updating data source reference in Postgres") + updateRegistry(query) } - def updateConnectorStats(datasetId: String, lastFetchTimestamp: Timestamp, records: Long): Unit = { - val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(jsonb_set(connector_stats::jsonb, '{records}'," + + def updateConnectorStats(id: String, lastFetchTimestamp: Timestamp, records: Long): Int = { + val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(jsonb_set(coalesce(connector_stats, '{}')::jsonb, '{records}'," + s" ((COALESCE(connector_stats->>'records', '0')::int + $records)::text)::jsonb, true), '{last_fetch_timestamp}', " + - s"to_jsonb('$lastFetchTimestamp'::timestamp), true) WHERE dataset_id = '$datasetId'" - updateRegistry(query, "Exception while updating connector stats in Postgres") + s"to_jsonb('$lastFetchTimestamp'::timestamp), true) WHERE id = '$id'" + updateRegistry(query) } - def updateConnectorDisconnections(datasetId: String, disconnections: Int): Unit = { - val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(connector_stats::jsonb, " + - s"'{disconnections}','$disconnections') WHERE dataset_id = '$datasetId'" - updateRegistry(query, "Exception while updating connector disconnections in Postgres") + def updateConnectorDisconnections(id: String, disconnections: Int): Int = { + val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(coalesce(connector_stats, '{}')::jsonb, '{disconnections}','$disconnections') WHERE id = '$id'" + updateRegistry(query) } - def updateConnectorAvgBatchReadTime(datasetId: String, avgReadTime: Long): Unit = { - val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(connector_stats::jsonb, " + - s"'{avg_batch_read_time}','$avgReadTime') WHERE dataset_id = '$datasetId'" - updateRegistry(query, "Exception while updating connector average batch read time in Postgres") + def updateConnectorAvgBatchReadTime(id: String, avgReadTime: Long): Int = { + val query = s"UPDATE dataset_source_config SET connector_stats = jsonb_set(coalesce(connector_stats, '{}')::jsonb, '{avg_batch_read_time}','$avgReadTime') WHERE id = '$id'" + updateRegistry(query) } - def updateRegistry(query: String, errorMsg: String): Unit = { + private def updateRegistry(query: String): Int = { val postgresConnect = new PostgresConnect(postgresConfig) try { - // TODO: Check if the udpate is successful. Else throw an Exception - postgresConnect.execute(query) - } catch { - case ex: Exception => - logger.error(errorMsg, ex) - Map() + postgresConnect.executeUpdate(query) } finally { postgresConnect.closeConnection() } @@ -162,7 +166,7 @@ object DatasetRegistryService { if (denormConfig == null) None else Some(JSONUtil.deserialize[DenormConfig](denormConfig)), JSONUtil.deserialize[RouterConfig](routerConfig), JSONUtil.deserialize[DatasetConfig](datasetConfig), - status, + DatasetStatus.withName(status), Option(tags), Option(dataVersion) ) @@ -177,19 +181,19 @@ object DatasetRegistryService { val status = rs.getString("status") DatasetSourceConfig(id = id, datasetId = datasetId, connectorType = connectorType, - JSONUtil.deserialize[ConnectorConfig](connectorConfig), - JSONUtil.deserialize[ConnectorStats](connectorStats), - status + JSONUtil.deserialize[ConnectorConfig](connectorConfig), status, + if(connectorStats != null) Some(JSONUtil.deserialize[ConnectorStats](connectorStats)) else None ) } private def parseDatasource(rs: ResultSet): DataSource = { + val id = rs.getString("id") val datasource = rs.getString("datasource") val datasetId = rs.getString("dataset_id") val ingestionSpec = rs.getString("ingestion_spec") val datasourceRef = rs.getString("datasource_ref") - DataSource(datasource, datasetId, ingestionSpec, datasourceRef) + DataSource(id, datasource, datasetId, ingestionSpec, datasourceRef) } private def parseDatasetTransformation(rs: ResultSet): DatasetTransformation = { @@ -198,8 +202,9 @@ object DatasetRegistryService { val fieldKey = rs.getString("field_key") val transformationFunction = rs.getString("transformation_function") val status = rs.getString("status") + val mode = rs.getString("mode") - DatasetTransformation(id, datasetId, fieldKey, JSONUtil.deserialize[TransformationFunction](transformationFunction), status) + DatasetTransformation(id, datasetId, fieldKey, JSONUtil.deserialize[TransformationFunction](transformationFunction), status, Some(if(mode != null) TransformMode.withName(mode) else TransformMode.Strict)) } } \ No newline at end of file diff --git a/dataset-registry/src/main/scala/org/sunbird/obsrv/streaming/BaseDatasetProcessFunction.scala b/dataset-registry/src/main/scala/org/sunbird/obsrv/streaming/BaseDatasetProcessFunction.scala new file mode 100644 index 00000000..4e992eba --- /dev/null +++ b/dataset-registry/src/main/scala/org/sunbird/obsrv/streaming/BaseDatasetProcessFunction.scala @@ -0,0 +1,195 @@ +package org.sunbird.obsrv.streaming + +import org.apache.flink.api.scala.metrics.ScalaGauge +import org.apache.flink.configuration.Configuration +import org.apache.flink.streaming.api.functions.ProcessFunction +import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction +import org.apache.flink.streaming.api.windowing.windows.TimeWindow +import org.sunbird.obsrv.core.model.FunctionalError.FunctionalError +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model.Producer.Producer +import org.sunbird.obsrv.core.model.Stats.Stats +import org.sunbird.obsrv.core.model.StatusCode.StatusCode +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming._ +import org.sunbird.obsrv.core.util.JSONUtil +import org.sunbird.obsrv.model.DatasetModels.Dataset +import org.sunbird.obsrv.registry.DatasetRegistry + +import java.lang +import java.util.concurrent.ConcurrentHashMap +import java.util.concurrent.atomic.AtomicLong +import scala.collection.JavaConverters._ +import scala.collection.mutable + +trait SystemEventHandler { + private def getStatus(flags: Map[String, AnyRef], producer: Producer): Option[StatusCode] = { + flags.get(producer.toString).map(f => StatusCode.withName(f.asInstanceOf[String])) + } + + private def getTime(timespans: Map[String, AnyRef], producer: Producer): Option[Long] = { + timespans.get(producer.toString).map(f => f.asInstanceOf[Long]) + } + + private def getStat(obsrvMeta: Map[String, AnyRef], stat: Stats): Option[Long] = { + obsrvMeta.get(stat.toString).map(f => f.asInstanceOf[Long]) + } + + def getError(error: ErrorConstants.Error, producer: Producer, functionalError: FunctionalError): Option[ErrorLog] = { + Some(ErrorLog(pdata_id = producer, pdata_status = StatusCode.failed, error_type = functionalError, error_code = error.errorCode, error_message = error.errorMsg, error_level = ErrorLevel.critical, error_count = Some(1))) + } + + def generateSystemEvent(dataset: Option[String], event: mutable.Map[String, AnyRef], config: BaseJobConfig[_], producer: Producer, error: Option[ErrorLog] = None, dataset_type: Option[String] = None): String = { + val obsrvMeta = event("obsrv_meta").asInstanceOf[Map[String, AnyRef]] + val flags = obsrvMeta("flags").asInstanceOf[Map[String, AnyRef]] + val timespans = obsrvMeta("timespans").asInstanceOf[Map[String, AnyRef]] + + JSONUtil.serialize(SystemEvent( + EventID.METRIC, ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(producer)), dataset = dataset, dataset_type = dataset_type), + data = EData(error = error, pipeline_stats = Some(PipelineStats(extractor_events = None, + extractor_status = getStatus(flags, Producer.extractor), extractor_time = getTime(timespans, Producer.extractor), + validator_status = getStatus(flags, Producer.validator), validator_time = getTime(timespans, Producer.validator), + dedup_status = getStatus(flags, Producer.dedup), dedup_time = getTime(timespans, Producer.dedup), + denorm_status = getStatus(flags, Producer.denorm), denorm_time = getTime(timespans, Producer.denorm), + transform_status = getStatus(flags, Producer.transformer), transform_time = getTime(timespans, Producer.transformer), + total_processing_time = getStat(obsrvMeta, Stats.total_processing_time), latency_time = getStat(obsrvMeta, Stats.latency_time), processing_time = getStat(obsrvMeta, Stats.processing_time) + ))) + )) + } + + def getDatasetId(dataset: Option[String], config: BaseJobConfig[_]): String = { + dataset.getOrElse(config.defaultDatasetID) + } + +} + +abstract class BaseDatasetProcessFunction(config: BaseJobConfig[mutable.Map[String, AnyRef]]) + extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) with SystemEventHandler { + + override def open(parameters: Configuration): Unit = { + super.open(parameters) + } + + def getMetrics(): List[String] + + override def getMetricsList(): MetricsList = { + val metrics = getMetrics() ++ List(config.eventFailedMetricsCount) + MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + } + + private def initMetrics(datasetId: String): Unit = { + if(!metrics.hasDataset(datasetId)) { + val metricMap = new ConcurrentHashMap[String, AtomicLong]() + metricsList.metrics.map(metric => { + metricMap.put(metric, new AtomicLong(0L)) + getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(datasetId) + .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => metrics.getAndReset(datasetId, metric))) + }) + metrics.initDataset(datasetId, metricMap) + } + } + + def markFailure(datasetId: Option[String], event: mutable.Map[String, AnyRef], ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + metrics: Metrics, error: ErrorConstants.Error, producer: Producer, functionalError: FunctionalError, datasetType: Option[String] = None): Unit = { + + metrics.incCounter(getDatasetId(datasetId, config), config.eventFailedMetricsCount) + ctx.output(config.failedEventsOutputTag(), super.markFailed(event, error, producer)) + val errorLog = getError(error, producer, functionalError) + val systemEvent = generateSystemEvent(Some(getDatasetId(datasetId, config)), event, config, producer, errorLog, datasetType) + ctx.output(config.systemEventsOutputTag, systemEvent) + } + + def markCompletion(dataset: Dataset, event: mutable.Map[String, AnyRef], ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, producer: Producer): Unit = { + ctx.output(config.systemEventsOutputTag, generateSystemEvent(Some(dataset.id), super.markComplete(event, dataset.dataVersion), config, producer, dataset_type = Some(dataset.datasetType))) + } + + def processElement(dataset: Dataset, event: mutable.Map[String, AnyRef],context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit + override def processElement(event: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { + + val datasetIdOpt = event.get(config.CONST_DATASET) + if (datasetIdOpt.isEmpty) { + markFailure(None, event, context, metrics, ErrorConstants.MISSING_DATASET_ID, Producer.validator, FunctionalError.MissingDatasetId) + return + } + val datasetId = datasetIdOpt.get.asInstanceOf[String] + initMetrics(datasetId) + val datasetOpt = DatasetRegistry.getDataset(datasetId) + if (datasetOpt.isEmpty) { + markFailure(Some(datasetId), event, context, metrics, ErrorConstants.MISSING_DATASET_CONFIGURATION, Producer.validator, FunctionalError.MissingDatasetId) + return + } + val dataset = datasetOpt.get + if (!super.containsEvent(event)) { + markFailure(Some(datasetId), event, context, metrics, ErrorConstants.EVENT_MISSING, Producer.validator, FunctionalError.MissingEventData) + return + } + processElement(dataset, event, context, metrics) + } +} + +abstract class BaseDatasetWindowProcessFunction(config: BaseJobConfig[mutable.Map[String, AnyRef]]) + extends WindowBaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String](config) with SystemEventHandler { + + override def open(parameters: Configuration): Unit = { + super.open(parameters) + } + + def getMetrics(): List[String] + + override def getMetricsList(): MetricsList = { + val metrics = getMetrics() ++ List(config.eventFailedMetricsCount) + MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + } + + private def initMetrics(datasetId: String): Unit = { + if(!metrics.hasDataset(datasetId)) { + val metricMap = new ConcurrentHashMap[String, AtomicLong]() + metricsList.metrics.map(metric => { + metricMap.put(metric, new AtomicLong(0L)) + getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(datasetId) + .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => metrics.getAndReset(datasetId, metric))) + }) + metrics.initDataset(datasetId, metricMap) + } + } + + def markFailure(datasetId: Option[String], event: mutable.Map[String, AnyRef], ctx: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, + metrics: Metrics, error: ErrorConstants.Error, producer: Producer, functionalError: FunctionalError, datasetType: Option[String] = None): Unit = { + metrics.incCounter(getDatasetId(datasetId, config), config.eventFailedMetricsCount) + ctx.output(config.failedEventsOutputTag(), super.markFailed(event, error, producer)) + val errorLog = getError(error, producer, functionalError) + val systemEvent = generateSystemEvent(Some(getDatasetId(datasetId, config)), event, config, producer, errorLog, datasetType) + ctx.output(config.systemEventsOutputTag, systemEvent) + } + + def markCompletion(dataset: Dataset, event: mutable.Map[String, AnyRef], ctx: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, producer: Producer): Unit = { + ctx.output(config.systemEventsOutputTag, generateSystemEvent(Some(dataset.id), super.markComplete(event, dataset.dataVersion), config, producer, dataset_type = Some(dataset.datasetType))) + } + + def processWindow(dataset: Dataset, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: List[mutable.Map[String, AnyRef]], metrics: Metrics): Unit + override def process(datasetId: String, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: lang.Iterable[mutable.Map[String, AnyRef]], metrics: Metrics): Unit = { + + initMetrics(datasetId) + val datasetOpt = DatasetRegistry.getDataset(datasetId) + val eventsList = elements.asScala.toList + if (datasetOpt.isEmpty) { + eventsList.foreach(event => { + markFailure(Some(datasetId), event, context, metrics, ErrorConstants.MISSING_DATASET_CONFIGURATION, Producer.validator, FunctionalError.MissingDatasetId) + }) + return + } + val dataset = datasetOpt.get + val buffer = mutable.Buffer[mutable.Map[String, AnyRef]]() + eventsList.foreach(event => { + if (!super.containsEvent(event)) { + markFailure(Some(datasetId), event, context, metrics, ErrorConstants.EVENT_MISSING, Producer.validator, FunctionalError.MissingEventData) + } else { + buffer.append(event) + } + }) + + if(buffer.nonEmpty) { + processWindow(dataset, context, buffer.toList, metrics) + } + } +} \ No newline at end of file diff --git a/dataset-registry/src/test/resources/base-config.conf b/dataset-registry/src/test/resources/baseconfig.conf similarity index 100% rename from dataset-registry/src/test/resources/base-config.conf rename to dataset-registry/src/test/resources/baseconfig.conf diff --git a/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/BaseSpecWithDatasetRegistry.scala b/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/BaseSpecWithDatasetRegistry.scala index 116bbc54..09321143 100644 --- a/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/BaseSpecWithDatasetRegistry.scala +++ b/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/BaseSpecWithDatasetRegistry.scala @@ -4,11 +4,12 @@ import com.typesafe.config.{Config, ConfigFactory} import org.sunbird.obsrv.core.util.{PostgresConnect, PostgresConnectionConfig} import org.sunbird.spec.BaseSpecWithPostgres -class BaseSpecWithDatasetRegistry extends BaseSpecWithPostgres { +import scala.collection.mutable +class BaseSpecWithDatasetRegistry extends BaseSpecWithPostgres { val config: Config = ConfigFactory.load("test.conf") - val postgresConfig = PostgresConnectionConfig( + val postgresConfig: PostgresConnectionConfig = PostgresConnectionConfig( user = config.getString("postgres.user"), password = config.getString("postgres.password"), database = "postgres", @@ -17,31 +18,46 @@ class BaseSpecWithDatasetRegistry extends BaseSpecWithPostgres { maxConnections = config.getInt("postgres.maxConnections") ) - override def beforeAll() { + override def beforeAll(): Unit = { super.beforeAll() val postgresConnect = new PostgresConnect(postgresConfig) + createSystemSettings(postgresConnect) createSchema(postgresConnect) insertTestData(postgresConnect) + postgresConnect.closeConnection() } override def afterAll(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + clearSystemSettings(postgresConnect) super.afterAll() } - private def createSchema(postgresConnect: PostgresConnect) { + private def createSchema(postgresConnect: PostgresConnect): Unit = { postgresConnect.execute("CREATE TABLE IF NOT EXISTS datasets ( id text PRIMARY KEY, type text NOT NULL, validation_config json, extraction_config json, dedup_config json, data_schema json, denorm_config json, router_config json NOT NULL, dataset_config json NOT NULL, status text NOT NULL, tags text[], data_version INT, created_by text NOT NULL, updated_by text NOT NULL, created_date timestamp NOT NULL, updated_date timestamp NOT NULL );") postgresConnect.execute("CREATE TABLE IF NOT EXISTS datasources ( id text PRIMARY KEY, dataset_id text REFERENCES datasets (id), ingestion_spec json NOT NULL, datasource text NOT NULL, datasource_ref text NOT NULL, retention_period json, archival_policy json, purge_policy json, backup_config json NOT NULL, status text NOT NULL, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL );") - postgresConnect.execute("CREATE TABLE IF NOT EXISTS dataset_transformations ( id text PRIMARY KEY, dataset_id text REFERENCES datasets (id), field_key text NOT NULL, transformation_function text NOT NULL, status text NOT NULL, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL, UNIQUE(field_key, dataset_id) );") - postgresConnect.execute("CREATE TABLE IF NOT EXISTS dataset_source_config ( id SERIAL PRIMARY KEY, dataset_id text NOT NULL REFERENCES datasets (id), connector_type text NOT NULL, connector_config json NOT NULL, status text NOT NULL, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL, UNIQUE(dataset_id) );") + postgresConnect.execute("CREATE TABLE IF NOT EXISTS dataset_transformations ( id text PRIMARY KEY, dataset_id text REFERENCES datasets (id), field_key text NOT NULL, transformation_function json NOT NULL, status text NOT NULL, mode text, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL, UNIQUE(field_key, dataset_id) );") + postgresConnect.execute("CREATE TABLE IF NOT EXISTS dataset_source_config ( id text PRIMARY KEY, dataset_id text NOT NULL REFERENCES datasets (id), connector_type text NOT NULL, connector_config json NOT NULL, status text NOT NULL, connector_stats json, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL, UNIQUE(connector_type, dataset_id) );") } private def insertTestData(postgresConnect: PostgresConnect): Unit = { - postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, extraction_config, dedup_config, router_config, dataset_config, status, data_version, created_by, updated_by, created_date, updated_date) values ('d1', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"validate\": true, \"mode\": \"Strict\"}', '{\"is_batch_event\": true, \"extraction_key\": \"events\", \"dedup_config\": {\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}}', '{\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}', '{\"topic\":\"d1-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":\"6340\",\"redis_db\":2}', 'ACTIVE', 2, 'System', 'System', now(), now());") - postgresConnect.execute("update datasets set denorm_config = '{\"redis_db_host\":\"localhost\",\"redis_db_port\":\"6340\",\"denorm_fields\":[{\"denorm_key\":\"vehicleCode\",\"redis_db\":2,\"denorm_out_field\":\"vehicleData\"}]}' where id='d1';") - postgresConnect.execute("insert into dataset_transformations values('tf1', 'd1', 'dealer.email', '{\"type\":\"mask\",\"expr\":\"dealer.email\"}', 'active', 'System', 'System', now(), now());") - postgresConnect.execute("insert into dataset_transformations values('tf2', 'd1', 'dealer.maskedPhone', '{\"type\":\"mask\",\"expr\": \"dealer.phone\"}', 'active', 'System', 'System', now(), now());") - postgresConnect.execute("insert into datasets(id, type, data_schema, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d2', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"topic\":\"d1-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'ACTIVE', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, extraction_config, dedup_config, router_config, dataset_config, status, data_version, created_by, updated_by, created_date, updated_date) values ('d1', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"validate\": true, \"mode\": \"Strict\"}', '{\"is_batch_event\": true, \"extraction_key\": \"events\", \"dedup_config\": {\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}}', '{\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}', '{\"topic\":\"d1-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+config.getInt("redis.port")+",\"redis_db\":2}', 'Live', 2, 'System', 'System', now(), now());") + postgresConnect.execute("update datasets set denorm_config = '{\"redis_db_host\":\"localhost\",\"redis_db_port\":"+config.getInt("redis.port")+",\"denorm_fields\":[{\"denorm_key\":\"vehicleCode\",\"redis_db\":2,\"denorm_out_field\":\"vehicleData\"}]}' where id='d1';") + postgresConnect.execute("insert into dataset_transformations values('tf1', 'd1', 'dealer.email', '{\"type\":\"mask\",\"expr\":\"dealer.email\"}', 'Live', 'Strict', 'System', 'System', now(), now());") + postgresConnect.execute("insert into dataset_transformations values('tf2', 'd1', 'dealer.maskedPhone', '{\"type\":\"mask\",\"expr\": \"dealer.phone\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date, tags) values ('d2', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now(), ARRAY['Tag1','Tag2']);") + } + + def getPrintableMetrics(metricsMap: mutable.Map[String, Long]): Map[String, Map[String, Map[String, Long]]] = { + metricsMap.map(f => { + val keys = f._1.split('.') + val metricValue = f._2 + val jobId = keys.apply(0) + val datasetId = keys.apply(1) + val metric = keys.apply(2) + (jobId, datasetId, metric, metricValue) + }).groupBy(f => f._1).mapValues(f => f.map(p => (p._2, p._3, p._4))).mapValues(f => f.groupBy(p => p._1).mapValues(q => q.map(r => (r._2, r._3)).toMap)) } -} +} \ No newline at end of file diff --git a/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/TestDatasetRegistrySpec.scala b/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/TestDatasetRegistrySpec.scala index 27df2676..b37e801a 100644 --- a/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/TestDatasetRegistrySpec.scala +++ b/dataset-registry/src/test/scala/org/sunbird/obsrv/spec/TestDatasetRegistrySpec.scala @@ -2,32 +2,98 @@ package org.sunbird.obsrv.spec import org.scalatest.Matchers import org.scalatestplus.mockito.MockitoSugar +import org.sunbird.obsrv.core.util.PostgresConnect import org.sunbird.obsrv.registry.DatasetRegistry +import java.sql.Timestamp +import java.time.{LocalDateTime, ZoneOffset} + class TestDatasetRegistrySpec extends BaseSpecWithDatasetRegistry with Matchers with MockitoSugar { "TestDatasetRegistrySpec" should "validate all the registry service methods" in { val d1Opt = DatasetRegistry.getDataset("d1") - d1Opt should not be (None) - d1Opt.get.id should be ("d1") - d1Opt.get.dataVersion.get should be (2) + d1Opt should not be None + d1Opt.get.id should be("d1") + d1Opt.get.dataVersion.get should be(2) val d2Opt = DatasetRegistry.getDataset("d2") - d2Opt should not be (None) - d2Opt.get.id should be ("d2") - d2Opt.get.denormConfig should be (None) + d2Opt should not be None + d2Opt.get.id should be("d2") + d2Opt.get.denormConfig should be(None) + + val postgresConnect = new PostgresConnect(postgresConfig) + postgresConnect.execute("insert into datasets(id, type, data_schema, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date, tags) values ('d3', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now(), ARRAY['Tag1','Tag2']);") + postgresConnect.closeConnection() + + val d3Opt = DatasetRegistry.getDataset("d3") + d3Opt should not be None + d3Opt.get.id should be("d3") + d3Opt.get.denormConfig should be(None) + + val d4Opt = DatasetRegistry.getDataset("d4") + d4Opt should be (None) val allDatasets = DatasetRegistry.getAllDatasets("dataset") - allDatasets.size should be (2) + allDatasets.size should be(3) val d1Tfs = DatasetRegistry.getDatasetTransformations("d1") - d1Tfs should not be (None) - d1Tfs.get.size should be (2) + d1Tfs should not be None + d1Tfs.get.size should be(2) + + val ids = DatasetRegistry.getDataSetIds("dataset").sortBy(f => f) + ids.head should be("d1") + ids.apply(1) should be("d2") + ids.apply(2) should be("d3") + + DatasetRegistry.getAllDatasetSourceConfig().get.size should be(2) + val datasetSourceConfigList = DatasetRegistry.getDatasetSourceConfigById("d1").get + val datasetSourceConfig = datasetSourceConfigList.filter(f => f.id.equals("sc1")).head + datasetSourceConfig.id should be("sc1") + datasetSourceConfig.datasetId should be("d1") + datasetSourceConfig.connectorType should be("kafka") + datasetSourceConfig.status should be("Live") + + val instant1 = LocalDateTime.now(ZoneOffset.UTC) + DatasetRegistry.updateConnectorStats("sc1", Timestamp.valueOf(instant1), 20L) + DatasetRegistry.updateConnectorDisconnections("sc1", 2) + DatasetRegistry.updateConnectorDisconnections("sc1", 4) + DatasetRegistry.updateConnectorAvgBatchReadTime("sc1", 4) + DatasetRegistry.updateConnectorAvgBatchReadTime("sc1", 5) + val instant2 = LocalDateTime.now(ZoneOffset.UTC) - val ids = DatasetRegistry.getDataSetIds("dataset") - ids.head should be ("d1") - ids.last should be ("d2") + DatasetRegistry.updateConnectorStats("sc1", Timestamp.valueOf(instant2), 60L) + val datasetSourceConfigList2 = DatasetRegistry.getDatasetSourceConfigById("d1").get + val datasetSourceConfig2 = datasetSourceConfigList2.filter(f => f.id.equals("sc1")).head + datasetSourceConfig2.connectorStats.get.records should be(80) + datasetSourceConfig2.connectorStats.get.disconnections should be(4) + datasetSourceConfig2.connectorStats.get.avgBatchReadTime should be(5) + datasetSourceConfig2.connectorStats.get.lastFetchTimestamp.getTime should be(instant2.toInstant(ZoneOffset.UTC).toEpochMilli) + + val datasource = DatasetRegistry.getDatasources("d1").get.head + datasource.datasetId should be("d1") + datasource.datasource should be("d1-datasource") + datasource.datasourceRef should be("d1-datasource-1") + + DatasetRegistry.updateDatasourceRef(datasource, "d1-datasource-2") + val datasource2 = DatasetRegistry.getDatasources("d1").get.head + datasource2.datasourceRef should be("d1-datasource-2") + + DatasetRegistry.getDatasources("d2").get.nonEmpty should be(false) } -} + override def beforeAll(): Unit = { + super.beforeAll() + prepareTestData() + } + + private def prepareTestData(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + postgresConnect.execute("insert into dataset_source_config values('sc1', 'd1', 'kafka', '{\"kafkaBrokers\":\"localhost:9090\",\"topic\":\"test-topic\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.execute("insert into dataset_source_config values('sc2', 'd1', 'rdbms', '{\"type\":\"postgres\",\"tableName\":\"test-table\"}', 'Live', null, 'System', 'System', now(), now());") + + //postgresConnect.execute("CREATE TABLE IF NOT EXISTS datasources ( id text PRIMARY KEY, dataset_id text REFERENCES datasets (id), ingestion_spec json NOT NULL, datasource text NOT NULL, datasource_ref text NOT NULL, retention_period json, archival_policy json, purge_policy json, backup_config json NOT NULL, status text NOT NULL, created_by text NOT NULL, updated_by text NOT NULL, created_date Date NOT NULL, updated_date Date NOT NULL );") + postgresConnect.execute("insert into datasources values('ds1', 'd1', '{}', 'd1-datasource', 'd1-datasource-1', null, null, null, '{}', 'Live', 'System', 'System', now(), now());") + postgresConnect.closeConnection() + } +} \ No newline at end of file diff --git a/framework/pom.xml b/framework/pom.xml index 31402224..52ced63f 100644 --- a/framework/pom.xml +++ b/framework/pom.xml @@ -35,7 +35,6 @@ org.apache.kafka kafka-clients ${kafka.version} - provided joda-time @@ -90,6 +89,12 @@ 3.0.6 test + + org.scalamock + scalamock_2.12 + 5.2.0 + test + junit junit @@ -134,9 +139,9 @@ test - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/framework/src/main/resources/baseconfig.conf b/framework/src/main/resources/baseconfig.conf index 5dc27105..e41f7e4b 100644 --- a/framework/src/main/resources/baseconfig.conf +++ b/framework/src/main/resources/baseconfig.conf @@ -8,6 +8,7 @@ kafka { compression = "snappy" } output.system.event.topic = ${job.env}".system.events" + output.failed.topic = ${job.env}".failed" } job { @@ -54,9 +55,4 @@ postgres { user = "postgres" password = "postgres" database = "postgres" -} - -lms-cassandra { - host = "localhost" - port = "9042" } \ No newline at end of file diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/cache/DedupEngine.scala b/framework/src/main/scala/org/sunbird/obsrv/core/cache/DedupEngine.scala index f04b0015..a03477b7 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/cache/DedupEngine.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/cache/DedupEngine.scala @@ -1,45 +1,23 @@ package org.sunbird.obsrv.core.cache -import org.slf4j.LoggerFactory import redis.clients.jedis.Jedis import redis.clients.jedis.exceptions.JedisException class DedupEngine(redisConnect: RedisConnect, store: Int, expirySeconds: Int) extends Serializable { - private[this] val logger = LoggerFactory.getLogger(classOf[DedupEngine]) - private val serialVersionUID = 6089562751616425354L private[this] var redisConnection: Jedis = redisConnect.getConnection redisConnection.select(store) @throws[JedisException] def isUniqueEvent(checksum: String): Boolean = { - var unique = false - try { - unique = !redisConnection.exists(checksum) - } catch { - case ex: JedisException => - logger.error("DedupEngine:isUniqueEvent() - Exception", ex) - this.redisConnection.close() - this.redisConnection = redisConnect.getConnection(this.store, backoffTimeInMillis = 10000) - unique = !this.redisConnection.exists(checksum) - } - unique + !redisConnection.exists(checksum) } @throws[JedisException] def storeChecksum(checksum: String): Unit = { - try - redisConnection.setex(checksum, expirySeconds, "") - catch { - case ex: JedisException => - logger.error("DedupEngine:storeChecksum() - Exception", ex) - this.redisConnection.close() - this.redisConnection = redisConnect.getConnection(this.store, backoffTimeInMillis = 10000) - this.redisConnection.select(this.store) - this.redisConnection.setex(checksum, expirySeconds, "") - } + redisConnection.setex(checksum, expirySeconds, "") } def getRedisConnection: Jedis = redisConnection diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/cache/RedisConnect.scala b/framework/src/main/scala/org/sunbird/obsrv/core/cache/RedisConnect.scala index 1bffdb7a..d96d6610 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/cache/RedisConnect.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/cache/RedisConnect.scala @@ -16,6 +16,7 @@ class RedisConnect(redisHost: String, redisPort: Int, defaultTimeOut: Int) exten catch { case e: InterruptedException => logger.error("RedisConnect:getConnection() - Exception", e) + e.printStackTrace() } // $COVERAGE-ON$ logger.info("Obtaining new Redis connection...") diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/model/Constants.scala b/framework/src/main/scala/org/sunbird/obsrv/core/model/Constants.scala new file mode 100644 index 00000000..2cfbd307 --- /dev/null +++ b/framework/src/main/scala/org/sunbird/obsrv/core/model/Constants.scala @@ -0,0 +1,18 @@ +package org.sunbird.obsrv.core.model + +object Constants { + + val EVENT = "event" + val INVALID_JSON = "invalid_json" + val OBSRV_META = "obsrv_meta" + val SRC = "src" + val ERROR_CODE = "error_code" + val ERROR_MSG = "error_msg" + val ERROR_REASON = "error_reason" + val FAILED = "failed" + val ERROR = "error" + val LEVEL = "level" + val TOPIC = "topic" + val MESSAGE = "message" + +} diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/model/ErrorConstants.scala b/framework/src/main/scala/org/sunbird/obsrv/core/model/ErrorConstants.scala index bbcd5828..efac0fbe 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/model/ErrorConstants.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/model/ErrorConstants.scala @@ -9,23 +9,32 @@ object ErrorConstants extends Enumeration { } val NO_IMPLEMENTATION_FOUND = ErrorInternalValue("ERR_0001", "Unimplemented method") - val NO_EXTRACTION_DATA_FOUND = ErrorInternalValue("ERR_EXT_1001", "Unable to extract the data from the extraction key") - val EXTRACTED_DATA_NOT_A_LIST = ErrorInternalValue("ERR_EXT_1002", "The extracted data is not an list") - val EVENT_SIZE_EXCEEDED = ErrorInternalValue("ERR_EXT_1003", ("Event size has exceeded max configured size of " + SystemConfig.maxEventSize)) - val EVENT_MISSING = ErrorInternalValue("ERR_EXT_1006", "Event missing in the batch event") + val EXTRACTED_DATA_NOT_A_LIST = ErrorInternalValue("ERR_EXT_1002", "The extracted data is not a list") + val EVENT_SIZE_EXCEEDED = ErrorInternalValue("ERR_EXT_1003", "Event size has exceeded max configured size") val MISSING_DATASET_ID = ErrorInternalValue("ERR_EXT_1004", "Dataset Id is missing from the data") val MISSING_DATASET_CONFIGURATION = ErrorInternalValue("ERR_EXT_1005", "Dataset configuration is missing") - + val EVENT_MISSING = ErrorInternalValue("ERR_EXT_1006", "Event missing in the batch event") val NO_DEDUP_KEY_FOUND = ErrorInternalValue("ERR_DEDUP_1007", "No dedup key found or missing data") - val DEDUP_KEY_NOT_A_STRING = ErrorInternalValue("ERR_DEDUP_1008", "Dedup key value is not a String or Text") + val DEDUP_KEY_NOT_A_STRING_OR_NUMBER = ErrorInternalValue("ERR_DEDUP_1008", "Dedup key value is not a String or Text") val DUPLICATE_BATCH_EVENT_FOUND = ErrorInternalValue("ERR_EXT_1009", "Duplicate batch event found") - val DUPLICATE_EVENT_FOUND = ErrorInternalValue("ERR_PP_1010", "Duplicate event found") val JSON_SCHEMA_NOT_FOUND = ErrorInternalValue("ERR_PP_1011", "Json schema not found for the dataset") val INVALID_JSON_SCHEMA = ErrorInternalValue("ERR_PP_1012", "Invalid json schema") val SCHEMA_VALIDATION_FAILED = ErrorInternalValue("ERR_PP_1013", "Event failed the schema validation") - val DENORM_KEY_MISSING = ErrorInternalValue("ERR_DENORM_1014", "No denorm key found or missing data for the specified key") - val DENORM_KEY_NOT_A_STRING = ErrorInternalValue("ERR_DENORM_1015", "Denorm key value is not a String or Text") + val DENORM_KEY_NOT_A_STRING_OR_NUMBER = ErrorInternalValue("ERR_DENORM_1015", "Denorm key value is not a String or Number") + val DENORM_DATA_NOT_FOUND = ErrorInternalValue("ERR_DENORM_1016", "Denorm data not found for the given key") + val MISSING_DATASET_CONFIG_KEY = ErrorInternalValue("ERR_MASTER_DATA_1017", "Master dataset configuration key is missing") + val ERR_INVALID_EVENT = ErrorInternalValue("ERR_EXT_1018", "Invalid JSON event, error while deserializing the event") + val INDEX_KEY_MISSING_OR_BLANK = ErrorInternalValue("ERR_ROUTER_1019", "Unable to index data as the timestamp key is missing or blank or not a datetime value") + val INVALID_EXPR_FUNCTION = ErrorInternalValue("ERR_TRANSFORM_1020", "Transformation expression function is not valid") + val ERR_EVAL_EXPR_FUNCTION = ErrorInternalValue("ERR_TRANSFORM_1021", "Unable to evaluate the transformation expression function") + val ERR_UNKNOWN_TRANSFORM_EXCEPTION = ErrorInternalValue("ERR_TRANSFORM_1022", "Unable to evaluate the transformation expression function") + val ERR_TRANSFORMATION_FAILED = ErrorInternalValue("ERR_TRANSFORM_1023", "Atleast one mandatory transformation has failed") + val TRANSFORMATION_FIELD_MISSING = ErrorInternalValue("ERR_TRANSFORM_1024", "Transformation field is either missing or blank") + val SYSTEM_SETTING_INVALID_TYPE = ErrorInternalValue("ERR_SYSTEM_SETTING_1025", "Invalid value type for system setting") + val SYSTEM_SETTING_NOT_FOUND = ErrorInternalValue("ERR_SYSTEM_SETTING_1026", "System setting not found for requested key") + val SYSTEM_SETTING_DEFAULT_VALUE_NOT_FOUND = ErrorInternalValue("ERR_SYSTEM_SETTING_1027", "Default value not found for requested key") + } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/model/Models.scala b/framework/src/main/scala/org/sunbird/obsrv/core/model/Models.scala index ebdbd315..87c91486 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/model/Models.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/model/Models.scala @@ -1,9 +1,83 @@ package org.sunbird.obsrv.core.model +import com.fasterxml.jackson.core.`type`.TypeReference +import com.fasterxml.jackson.module.scala.JsonScalaEnumeration +import org.sunbird.obsrv.core.model.ErrorLevel.ErrorLevel +import org.sunbird.obsrv.core.model.EventID.EventID +import org.sunbird.obsrv.core.model.FunctionalError.FunctionalError +import org.sunbird.obsrv.core.model.ModuleID.ModuleID +import org.sunbird.obsrv.core.model.PDataType.PDataType +import org.sunbird.obsrv.core.model.Producer.Producer +import org.sunbird.obsrv.core.model.StatusCode.StatusCode +import com.fasterxml.jackson.annotation.JsonProperty +import org.sunbird.obsrv.core.exception.ObsrvException + object Models { - case class PData(val id: String, val `type`: String, val pid: String) - - case class SystemEvent(val pdata: PData, data: Map[String, AnyRef] ) + case class PData(id: String,@JsonScalaEnumeration(classOf[PDataTypeType]) `type`: PDataType,@JsonScalaEnumeration(classOf[ProducerType]) pid: Option[Producer]) + + case class ContextData(@JsonScalaEnumeration(classOf[ModuleIDType]) module: ModuleID, pdata: PData, dataset: Option[String] = None, dataset_type: Option[String] = None, eid: Option[String] = None) + + case class ErrorLog(@JsonScalaEnumeration(classOf[ProducerType]) pdata_id: Producer, @JsonScalaEnumeration(classOf[StatusCodeType]) pdata_status: StatusCode, @JsonScalaEnumeration(classOf[FunctionalErrorType]) error_type: FunctionalError, error_code: String, error_message: String,@JsonScalaEnumeration(classOf[ErrorLevelType]) error_level: ErrorLevel, error_count:Option[Int] = None) + + case class PipelineStats(extractor_events: Option[Int] = None, @JsonScalaEnumeration(classOf[StatusCodeType]) extractor_status: Option[StatusCode] = None, + extractor_time: Option[Long] = None, @JsonScalaEnumeration(classOf[StatusCodeType]) validator_status: Option[StatusCode] = None, validator_time: Option[Long] = None, + @JsonScalaEnumeration(classOf[StatusCodeType]) dedup_status: Option[StatusCode] = None, dedup_time: Option[Long] = None, @JsonScalaEnumeration(classOf[StatusCodeType]) denorm_status: Option[StatusCode] = None, + denorm_time: Option[Long] = None, @JsonScalaEnumeration(classOf[StatusCodeType]) transform_status: Option[StatusCode] = None, transform_time: Option[Long] = None, + total_processing_time: Option[Long] = None, latency_time: Option[Long] = None, processing_time: Option[Long] = None) + + case class EData(error: Option[ErrorLog] = None, pipeline_stats: Option[PipelineStats] = None, extra: Option[Map[String, AnyRef]] = None) + + case class SystemEvent(@JsonScalaEnumeration(classOf[EventIDType]) etype: EventID, ctx: ContextData, data: EData, ets: Long = System.currentTimeMillis()) + case class SystemSetting(key: String, value: String, category: String, valueType: String, label: Option[String]) + +} + +class EventIDType extends TypeReference[EventID.type] +object EventID extends Enumeration { + type EventID = Value + val LOG, METRIC = Value +} + +class ErrorLevelType extends TypeReference[ErrorLevel.type] +object ErrorLevel extends Enumeration { + type ErrorLevel = Value + val debug, info, warn, critical = Value +} + +class FunctionalErrorType extends TypeReference[FunctionalError.type] +object FunctionalError extends Enumeration { + type FunctionalError = Value + val DedupFailed, RequiredFieldsMissing, DataTypeMismatch, AdditionalFieldsFound, UnknownValidationError, MissingDatasetId, MissingEventData, MissingTimestampKey, + EventSizeExceeded, ExtractionDataFormatInvalid, DenormKeyMissing, DenormKeyInvalid, DenormDataNotFound, InvalidJsonData, + TransformParseError, TransformEvalError, TransformFailedError, MissingMasterDatasetKey, TransformFieldMissing = Value +} + +class ProducerType extends TypeReference[Producer.type] +object Producer extends Enumeration { + type Producer = Value + val extractor, dedup, validator, denorm, transformer, router, masterdataprocessor = Value +} + +class ModuleIDType extends TypeReference[ModuleID.type] +object ModuleID extends Enumeration { + type ModuleID = Value + val ingestion, processing, storage, query = Value +} + +class StatusCodeType extends TypeReference[StatusCode.type] +object StatusCode extends Enumeration { + type StatusCode = Value + val success, failed, skipped, partial = Value +} + +class PDataTypeType extends TypeReference[PDataType.type] +object PDataType extends Enumeration { + type PDataType = Value + val flink, spark, druid, kafka, api = Value +} +object Stats extends Enumeration { + type Stats = Value + val total_processing_time, latency_time, processing_time = Value } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/model/SystemConfig.scala b/framework/src/main/scala/org/sunbird/obsrv/core/model/SystemConfig.scala index ee21152f..118e8c53 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/model/SystemConfig.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/model/SystemConfig.scala @@ -1,13 +1,117 @@ package org.sunbird.obsrv.core.model +import com.typesafe.config.{Config, ConfigFactory} +import org.sunbird.obsrv.core.exception.ObsrvException +import org.sunbird.obsrv.core.model.Models.SystemSetting +import org.sunbird.obsrv.core.util.{PostgresConnect, PostgresConnectionConfig} + +import java.io.File +import java.sql.ResultSet + object SystemConfig { - // TODO: Fetch the system config from postgres db - val defaultDedupPeriodInSeconds: Int = 604800 // 7 days - val maxEventSize: Long = 1048576 - val defaultDatasetId = "ALL" + private def getSystemConfig(key: String): Option[SystemSetting] = { + SystemConfigService.getSystemSetting(key) + } + + @throws[ObsrvException] + private def getConfigValueOpt(key: String, requiredType: String): Option[String] = { + + getSystemConfig(key).map(config => { + if (!config.valueType.equalsIgnoreCase(requiredType)) throw new ObsrvException(ErrorConstants.SYSTEM_SETTING_INVALID_TYPE) + config.value + }).orElse(None) + } + + private def getConfigValue(key: String, requiredType: String): String = { + + getSystemConfig(key).map(config => { + if (!config.valueType.equalsIgnoreCase(requiredType)) throw new ObsrvException(ErrorConstants.SYSTEM_SETTING_INVALID_TYPE) + config.value + }).orElse(throw new ObsrvException(ErrorConstants.SYSTEM_SETTING_NOT_FOUND)).get + } + + def getString(key: String): String = { + getConfigValue(key, requiredType = "string") + } + + def getString(key: String, defaultValue: String): String = { + getConfigValueOpt(key, requiredType = "string").getOrElse(defaultValue) + } + + def getInt(key: String): Int = { + getConfigValue(key, requiredType = "int").toInt + } + + def getInt(key: String, defaultValue: Int): Int = { + getConfigValueOpt(key, requiredType = "int").getOrElse(defaultValue.toString).toInt + } + + def getLong(key: String): Long = { + getConfigValue(key, requiredType = "long").toLong + } + + def getLong(key: String, defaultValue: Long): Long = { + getConfigValueOpt(key, requiredType = "long").getOrElse(defaultValue.toString).toLong + } + + def getBoolean(key: String): Boolean = { + getConfigValue(key, requiredType = "boolean").toBoolean + } + + def getBoolean(key: String, defaultValue: Boolean): Boolean = { + getConfigValueOpt(key, requiredType = "boolean").getOrElse(defaultValue.toString).toBoolean + } + +} + +object SystemConfigService { + + private val configFile = new File("/data/flink/conf/baseconfig.conf") + // $COVERAGE-OFF$ + val config: Config = if (configFile.exists()) { + println("Loading configuration file cluster baseconfig.conf...") + ConfigFactory.parseFile(configFile).resolve() + } else { + // $COVERAGE-ON$ + println("Loading configuration file baseconfig.conf inside the jar...") + ConfigFactory.load("baseconfig.conf").withFallback(ConfigFactory.systemEnvironment()) + } + private val postgresConfig = PostgresConnectionConfig( + config.getString("postgres.user"), + config.getString("postgres.password"), + config.getString("postgres.database"), + config.getString("postgres.host"), + config.getInt("postgres.port"), + config.getInt("postgres.maxConnections")) + + @throws[Exception] + def getAllSystemSettings: List[SystemSetting] = { + val postgresConnect = new PostgresConnect(postgresConfig) + val rs = postgresConnect.executeQuery("SELECT * FROM system_settings") + val result = Iterator.continually((rs, rs.next)).takeWhile(f => f._2).map(f => f._1).map(result => { + parseSystemSetting(result) + }).toList + postgresConnect.closeConnection() + result + } + + @throws[Exception] + def getSystemSetting(key: String): Option[SystemSetting] = { + val postgresConnect = new PostgresConnect(postgresConfig) + val rs = postgresConnect.executeQuery(s"SELECT * FROM system_settings WHERE key = '$key'") + if (rs.next) { + Option(parseSystemSetting(rs)) + } else None + } - // secret key length should be 16, 24 or 32 characters - val encryptionSecretKey = "ckW5GFkTtMDNGEr5k67YpQMEBJNX3x2f" + private def parseSystemSetting(rs: ResultSet): SystemSetting = { + val key = rs.getString("key") + val value = rs.getString("value") + val category = rs.getString("category") + val valueType = rs.getString("valuetype") + val label = rs.getString("label") + SystemSetting(key, value, category, valueType, Option(label)) + } } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/serde/MapSerde.scala b/framework/src/main/scala/org/sunbird/obsrv/core/serde/SerdeUtil.scala similarity index 55% rename from framework/src/main/scala/org/sunbird/obsrv/core/serde/MapSerde.scala rename to framework/src/main/scala/org/sunbird/obsrv/core/serde/SerdeUtil.scala index 299bab95..56525db4 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/serde/MapSerde.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/serde/SerdeUtil.scala @@ -1,7 +1,5 @@ package org.sunbird.obsrv.core.serde -import java.nio.charset.StandardCharsets - import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema @@ -9,17 +7,26 @@ import org.apache.flink.connector.kafka.source.reader.deserializer.KafkaRecordDe import org.apache.flink.util.Collector import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.kafka.clients.producer.ProducerRecord +import org.sunbird.obsrv.core.model.Constants import org.sunbird.obsrv.core.util.JSONUtil + +import java.nio.charset.StandardCharsets import scala.collection.mutable class MapDeserializationSchema extends KafkaRecordDeserializationSchema[mutable.Map[String, AnyRef]] { private val serialVersionUID = -3224825136576915426L + override def getProducedType: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]], out: Collector[mutable.Map[String, AnyRef]]): Unit = { - val msg = JSONUtil.deserialize[mutable.Map[String, AnyRef]](record.value()) + val msg = try { + JSONUtil.deserialize[mutable.Map[String, AnyRef]](record.value()) + } catch { + case _: Exception => + mutable.Map[String, AnyRef](Constants.INVALID_JSON -> new String(record.value, "UTF-8")) + } initObsrvMeta(msg, record) out.collect(msg) } @@ -35,16 +42,37 @@ class MapDeserializationSchema extends KafkaRecordDeserializationSchema[mutable. )) } } + } -class MapSerializationSchema(topic: String, key: Option[String] = None) extends KafkaRecordSerializationSchema[mutable.Map[String, AnyRef]] { +class StringDeserializationSchema extends KafkaRecordDeserializationSchema[String] { + + private val serialVersionUID = -3224825136576915426L + + override def getProducedType: TypeInformation[String] = TypeExtractor.getForClass(classOf[String]) + + override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]], out: Collector[String]): Unit = { + out.collect(new String(record.value(), StandardCharsets.UTF_8)) + } +} + +class SerializationSchema[T](topic: String) extends KafkaRecordSerializationSchema[T] { private val serialVersionUID = -4284080856874185929L - override def serialize(element: mutable.Map[String, AnyRef], context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: java.lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = { + override def serialize(element: T, context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: java.lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = { val out = JSONUtil.serialize(element) - key.map { kafkaKey => - new ProducerRecord[Array[Byte], Array[Byte]](topic, kafkaKey.getBytes(StandardCharsets.UTF_8), out.getBytes(StandardCharsets.UTF_8)) - }.getOrElse(new ProducerRecord[Array[Byte], Array[Byte]](topic, out.getBytes(StandardCharsets.UTF_8))) + new ProducerRecord[Array[Byte], Array[Byte]](topic, out.getBytes(StandardCharsets.UTF_8)) } } + +class DynamicMapSerializationSchema() extends KafkaRecordSerializationSchema[mutable.Map[String, AnyRef]] { + + private val serialVersionUID = -4284080856874185929L + + override def serialize(element: mutable.Map[String, AnyRef], context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: java.lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = { + val out = JSONUtil.serialize(element.get(Constants.MESSAGE)) + val topic = element.get(Constants.TOPIC).get.asInstanceOf[String] + new ProducerRecord[Array[Byte], Array[Byte]](topic, out.getBytes(StandardCharsets.UTF_8)) + } +} \ No newline at end of file diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/serde/StringSerde.scala b/framework/src/main/scala/org/sunbird/obsrv/core/serde/StringSerde.scala deleted file mode 100644 index 17768453..00000000 --- a/framework/src/main/scala/org/sunbird/obsrv/core/serde/StringSerde.scala +++ /dev/null @@ -1,32 +0,0 @@ -package org.sunbird.obsrv.core.serde - -import org.apache.flink.api.common.typeinfo.TypeInformation -import org.apache.flink.api.java.typeutils.TypeExtractor -import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema -import org.apache.flink.connector.kafka.source.reader.deserializer.KafkaRecordDeserializationSchema -import org.apache.flink.util.Collector -import org.apache.kafka.clients.consumer.ConsumerRecord -import org.apache.kafka.clients.producer.ProducerRecord - -import java.nio.charset.StandardCharsets - -class StringDeserializationSchema extends KafkaRecordDeserializationSchema[String] { - - private val serialVersionUID = -3224825136576915426L - override def getProducedType: TypeInformation[String] = TypeExtractor.getForClass(classOf[String]) - - override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]], out: Collector[String]): Unit = { - out.collect(new String(record.value(), StandardCharsets.UTF_8)) - } -} - -class StringSerializationSchema(topic: String, key: Option[String] = None) extends KafkaRecordSerializationSchema[String] { - - private val serialVersionUID = -4284080856874185929L - - override def serialize(element: String, context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: java.lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = { - key.map { kafkaKey => - new ProducerRecord[Array[Byte], Array[Byte]](topic, kafkaKey.getBytes(StandardCharsets.UTF_8), element.getBytes(StandardCharsets.UTF_8)) - }.getOrElse(new ProducerRecord[Array[Byte], Array[Byte]](topic, element.getBytes(StandardCharsets.UTF_8))) - } -} diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseDeduplication.scala b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseDeduplication.scala index 0e61610b..40a69191 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseDeduplication.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseDeduplication.scala @@ -1,39 +1,24 @@ package org.sunbird.obsrv.core.streaming -import org.apache.flink.streaming.api.functions.ProcessFunction import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.cache.DedupEngine import org.sunbird.obsrv.core.exception.ObsrvException -import org.sunbird.obsrv.core.model.ErrorConstants -import org.sunbird.obsrv.core.model.Models.{PData, SystemEvent} +import org.sunbird.obsrv.core.model._ import org.sunbird.obsrv.core.util.JSONUtil -import scala.collection.mutable - trait BaseDeduplication { private[this] val logger = LoggerFactory.getLogger(classOf[BaseDeduplication]) - def isDuplicate(datasetId: String, dedupKey: Option[String], event: String, - context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, - config: BaseJobConfig[_]) + def isDuplicate(datasetId: String, dedupKey: Option[String], event: String) (implicit deDupEngine: DedupEngine): Boolean = { - try { - val key = datasetId+":"+getDedupKey(dedupKey, event) - if (!deDupEngine.isUniqueEvent(key)) { - logger.debug(s"Event with mid: $key is duplicate") - true - } else { - deDupEngine.storeChecksum(key) - false - } - } catch { - case ex: ObsrvException => - logger.warn("BaseDeduplication:isDuplicate()-Exception", ex.getMessage) - val sysEvent = SystemEvent(PData(config.jobName, "flink", "deduplication"), Map("error_code" -> ex.error.errorCode, "error_msg" -> ex.error.errorMsg)) - context.output(config.systemEventsOutputTag, JSONUtil.serialize(sysEvent)) - false + val key = datasetId + ":" + getDedupKey(dedupKey, event) + if (!deDupEngine.isUniqueEvent(key)) { + true + } else { + deDupEngine.storeChecksum(key) + false } } @@ -45,10 +30,11 @@ trait BaseDeduplication { if (node.isMissingNode) { throw new ObsrvException(ErrorConstants.NO_DEDUP_KEY_FOUND) } - if (!node.isTextual) { - throw new ObsrvException(ErrorConstants.DEDUP_KEY_NOT_A_STRING) + if (!node.isTextual && !node.isNumber) { + logger.warn(s"Dedup | Dedup key is not a string or number | dedupKey=$dedupKey | keyType=${node.getNodeType}") + throw new ObsrvException(ErrorConstants.DEDUP_KEY_NOT_A_STRING_OR_NUMBER) } node.asText() } -} +} \ No newline at end of file diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseJobConfig.scala b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseJobConfig.scala index ff15af4f..dc6eaa66 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseJobConfig.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseJobConfig.scala @@ -17,7 +17,7 @@ abstract class BaseJobConfig[T](val config: Config, val jobName: String) extends implicit val metricTypeInfo: TypeInformation[String] = TypeExtractor.getForClass(classOf[String]) - val defaultDatasetID: String = SystemConfig.defaultDatasetId + def defaultDatasetID: String = SystemConfig.getString("defaultDatasetId", "ALL") private val kafkaProducerBrokerServers: String = config.getString("kafka.producer.broker-servers") private val kafkaConsumerBrokerServers: String = config.getString("kafka.consumer.broker-servers") // Producer Properties @@ -25,7 +25,7 @@ abstract class BaseJobConfig[T](val config: Config, val jobName: String) extends private val kafkaProducerBatchSize: Int = config.getInt("kafka.producer.batch.size") private val kafkaProducerLingerMs: Int = config.getInt("kafka.producer.linger.ms") private val kafkaProducerCompression: String = if (config.hasPath("kafka.producer.compression")) config.getString("kafka.producer.compression") else "snappy" - val groupId: String = config.getString("kafka.groupId") + private val groupId: String = config.getString("kafka.groupId") val restartAttempts: Int = config.getInt("task.restart-strategy.attempts") val delayBetweenAttempts: Long = config.getLong("task.restart-strategy.delay") val kafkaConsumerParallelism: Int = config.getInt("task.consumer.parallelism") @@ -45,14 +45,14 @@ abstract class BaseJobConfig[T](val config: Config, val jobName: String) extends val systemEventsProducer = "system-events-sink" // Checkpointing config - val enableCompressedCheckpointing: Boolean = config.getBoolean("job.enable.distributed.checkpointing") + val enableCompressedCheckpointing: Boolean = if (config.hasPath("job.enable.distributed.checkpointing")) config.getBoolean("job.enable.distributed.checkpointing") else false val checkpointingInterval: Int = config.getInt("task.checkpointing.interval") val checkpointingPauseSeconds: Int = config.getInt("task.checkpointing.pause.between.seconds") - val enableDistributedCheckpointing: Option[Boolean] = if (config.hasPath("job")) Option(config.getBoolean("job.enable.distributed.checkpointing")) else None - val checkpointingBaseUrl: Option[String] = if (config.hasPath("job")) Option(config.getString("job.statebackend.base.url")) else None + val enableDistributedCheckpointing: Option[Boolean] = if (config.hasPath("job.enable.distributed.checkpointing")) Option(config.getBoolean("job.enable.distributed.checkpointing")) else None + val checkpointingBaseUrl: Option[String] = if (config.hasPath("job.statebackend.base.url")) Option(config.getString("job.statebackend.base.url")) else None // Base Methods - def datasetType(): String = if(config.hasPath("dataset.type")) config.getString("dataset.type") else "dataset" + def datasetType(): String = if (config.hasPath("dataset.type")) config.getString("dataset.type") else "dataset" def inputTopic(): String @@ -60,6 +60,12 @@ abstract class BaseJobConfig[T](val config: Config, val jobName: String) extends def successTag(): OutputTag[T] + // Event Failures Common Variables + val failedEventProducer = "failed-events-sink" + val eventFailedMetricsCount: String = "failed-event-count" + val kafkaFailedTopic: String = config.getString("kafka.output.failed.topic") + def failedEventsOutputTag(): OutputTag[T] + def kafkaConsumerProperties(kafkaBrokerServers: Option[String] = None, kafkaConsumerGroup: Option[String] = None): Properties = { val properties = new Properties() properties.setProperty("bootstrap.servers", kafkaBrokerServers.getOrElse(kafkaConsumerBrokerServers)) diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseProcessFunction.scala b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseProcessFunction.scala index 8ab0e664..387842a5 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseProcessFunction.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseProcessFunction.scala @@ -6,10 +6,11 @@ import org.apache.flink.streaming.api.functions.ProcessFunction import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction import org.apache.flink.streaming.api.windowing.windows.TimeWindow import org.apache.flink.util.Collector -import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.model.ErrorConstants.Error -import org.sunbird.obsrv.core.model.SystemConfig -import org.sunbird.obsrv.core.util.Util +import org.sunbird.obsrv.core.model.Producer.Producer +import org.sunbird.obsrv.core.model.StatusCode.StatusCode +import org.sunbird.obsrv.core.model.{Constants, Stats, StatusCode, SystemConfig} +import org.sunbird.obsrv.core.util.{JSONUtil, Util} import java.lang import java.util.concurrent.ConcurrentHashMap @@ -18,13 +19,21 @@ import scala.collection.mutable case class MetricsList(datasets: List[String], metrics: List[String]) -case class Metrics(metrics: Map[String, ConcurrentHashMap[String, AtomicLong]]) { +case class Metrics(metrics: mutable.Map[String, ConcurrentHashMap[String, AtomicLong]]) { private def getMetric(dataset: String, metric: String): AtomicLong = { val datasetMetrics: ConcurrentHashMap[String, AtomicLong] = metrics.getOrElse(dataset, new ConcurrentHashMap[String, AtomicLong]()) datasetMetrics.getOrDefault(metric, new AtomicLong()) } + def hasDataset(dataset: String): Boolean = { + metrics.contains(dataset) + } + + def initDataset(dataset: String, counters: ConcurrentHashMap[String, AtomicLong]): Unit = { + metrics.put(dataset, counters) + } + def incCounter(dataset: String, metric: String): Unit = { getMetric(dataset, metric).getAndIncrement() } @@ -47,27 +56,29 @@ case class Metrics(metrics: Map[String, ConcurrentHashMap[String, AtomicLong]]) trait JobMetrics { def registerMetrics(datasets: List[String], metrics: List[String]): Metrics = { - val allDatasets = datasets ++ List(SystemConfig.defaultDatasetId) + val allDatasets = datasets ++ List(SystemConfig.getString("defaultDatasetId", "ALL")) val datasetMetricMap: Map[String, ConcurrentHashMap[String, AtomicLong]] = allDatasets.map(dataset => { val metricMap = new ConcurrentHashMap[String, AtomicLong]() metrics.foreach { metric => metricMap.put(metric, new AtomicLong(0L)) } (dataset, metricMap) }).toMap + val mutableMap = mutable.Map[String, ConcurrentHashMap[String, AtomicLong]]() + mutableMap ++= datasetMetricMap - Metrics(datasetMetricMap) + Metrics(mutableMap) } } trait BaseFunction { - private def addFlags(obsrvMeta: mutable.Map[String, AnyRef], flags: Map[String, AnyRef]) = { + def addFlags(obsrvMeta: mutable.Map[String, AnyRef], flags: Map[String, AnyRef]): Option[AnyRef] = { obsrvMeta.put("flags", obsrvMeta("flags").asInstanceOf[Map[String, AnyRef]] ++ flags) } - private def addError(obsrvMeta: mutable.Map[String, AnyRef], error: Map[String, AnyRef]) = { + private def addError(obsrvMeta: mutable.Map[String, AnyRef], error: Map[String, AnyRef]): Option[AnyRef] = { obsrvMeta.put("error", error) } - private def addTimespan(obsrvMeta: mutable.Map[String, AnyRef], jobName: String): Unit = { + def addTimespan(obsrvMeta: mutable.Map[String, AnyRef], producer: Producer): Unit = { val prevTS = if (obsrvMeta.contains("prevProcessingTime")) { obsrvMeta("prevProcessingTime").asInstanceOf[Long] } else { @@ -75,43 +86,49 @@ trait BaseFunction { } val currentTS = System.currentTimeMillis() val span = currentTS - prevTS - obsrvMeta.put("timespans", obsrvMeta("timespans").asInstanceOf[Map[String, AnyRef]] ++ Map(jobName -> span)) + obsrvMeta.put("timespans", obsrvMeta("timespans").asInstanceOf[Map[String, AnyRef]] ++ Map(producer.toString -> span)) obsrvMeta.put("prevProcessingTime", currentTS.asInstanceOf[AnyRef]) } - def markFailed(event: mutable.Map[String, AnyRef], error: Error, jobName: String): mutable.Map[String, AnyRef] = { - val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) - addError(obsrvMeta, Map("src" -> jobName, "error_code" -> error.errorCode, "error_msg" -> error.errorMsg)) - addFlags(obsrvMeta, Map(jobName -> "failed")) - addTimespan(obsrvMeta, jobName) - event.put("obsrv_meta", obsrvMeta.toMap) + def markFailed(event: mutable.Map[String, AnyRef], error: Error, producer: Producer): mutable.Map[String, AnyRef] = { + val obsrvMeta = Util.getMutableMap(event(Constants.OBSRV_META).asInstanceOf[Map[String, AnyRef]]) + addError(obsrvMeta, Map(Constants.SRC -> producer.toString, Constants.ERROR_CODE -> error.errorCode, Constants.ERROR_MSG -> error.errorMsg)) + addFlags(obsrvMeta, Map(producer.toString -> StatusCode.failed.toString)) + addTimespan(obsrvMeta, producer) + event.remove(Constants.OBSRV_META) + event.put(Constants.EVENT, JSONUtil.serialize(event)) + event.put(Constants.OBSRV_META, obsrvMeta.toMap) event } - def markSkipped(event: mutable.Map[String, AnyRef], jobName: String): mutable.Map[String, AnyRef] = { - val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) - addFlags(obsrvMeta, Map(jobName -> "skipped")) - addTimespan(obsrvMeta, jobName) - event.put("obsrv_meta", obsrvMeta.toMap) - event + def markSkipped(event: mutable.Map[String, AnyRef], producer: Producer): mutable.Map[String, AnyRef] = { + markStatus(event, producer, StatusCode.skipped) + } + + def markSuccess(event: mutable.Map[String, AnyRef], producer: Producer): mutable.Map[String, AnyRef] = { + markStatus(event, producer, StatusCode.success) } - def markSuccess(event: mutable.Map[String, AnyRef], jobName: String): mutable.Map[String, AnyRef] = { + def markPartial(event: mutable.Map[String, AnyRef], producer: Producer): mutable.Map[String, AnyRef] = { + markStatus(event, producer, StatusCode.partial) + } + + private def markStatus(event: mutable.Map[String, AnyRef], producer: Producer, statusCode: StatusCode): mutable.Map[String, AnyRef] = { val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) - addFlags(obsrvMeta, Map(jobName -> "success")) - addTimespan(obsrvMeta, jobName) + addFlags(obsrvMeta, Map(producer.toString -> statusCode.toString)) + addTimespan(obsrvMeta, producer) event.put("obsrv_meta", obsrvMeta.toMap) event } - def markComplete(event: mutable.Map[String, AnyRef], dataVersion: Option[Int]) : mutable.Map[String, AnyRef] = { + def markComplete(event: mutable.Map[String, AnyRef], dataVersion: Option[Int]): mutable.Map[String, AnyRef] = { val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) val syncts = obsrvMeta("syncts").asInstanceOf[Long] val processingStartTime = obsrvMeta("processingStartTime").asInstanceOf[Long] val processingEndTime = System.currentTimeMillis() - obsrvMeta.put("total_processing_time", (processingEndTime - syncts).asInstanceOf[AnyRef]) - obsrvMeta.put("latency_time", (processingStartTime - syncts).asInstanceOf[AnyRef]) - obsrvMeta.put("processing_time", (processingEndTime - processingStartTime).asInstanceOf[AnyRef]) + obsrvMeta.put(Stats.total_processing_time.toString, (processingEndTime - syncts).asInstanceOf[AnyRef]) + obsrvMeta.put(Stats.latency_time.toString, (processingStartTime - syncts).asInstanceOf[AnyRef]) + obsrvMeta.put(Stats.processing_time.toString, (processingEndTime - processingStartTime).asInstanceOf[AnyRef]) obsrvMeta.put("data_version", dataVersion.getOrElse(1).asInstanceOf[AnyRef]) event.put("obsrv_meta", obsrvMeta.toMap) event @@ -123,19 +140,29 @@ trait BaseFunction { } } -abstract class BaseProcessFunction[T, R](config: BaseJobConfig[R]) extends ProcessFunction[T, R] with BaseDeduplication with JobMetrics with BaseFunction { +abstract class BaseProcessFunction[T, R](config: BaseJobConfig[R]) extends ProcessFunction[T, R] with JobMetrics with BaseFunction { - private[this] val logger = LoggerFactory.getLogger(this.getClass) - private val metricsList = getMetricsList() - private val metrics: Metrics = registerMetrics(metricsList.datasets, metricsList.metrics) + protected val metricsList: MetricsList = getMetricsList() + protected val metrics: Metrics = registerMetrics(metricsList.datasets, metricsList.metrics) override def open(parameters: Configuration): Unit = { - (metricsList.datasets ++ List(SystemConfig.defaultDatasetId)).map { dataset => + metricsList.datasets.map { dataset => metricsList.metrics.map(metric => { getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(dataset) - .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => metrics.getAndReset(dataset, metric))) + .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => + // $COVERAGE-OFF$ + metrics.getAndReset(dataset, metric) + // $COVERAGE-ON$ + )) }) } + val defaultDatasetId = SystemConfig.getString("defaultDatasetId", "ALL") + getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(defaultDatasetId) + .gauge[Long, ScalaGauge[Long]](config.eventFailedMetricsCount, ScalaGauge[Long](() => + // $COVERAGE-OFF$ + metrics.getAndReset(defaultDatasetId, config.eventFailedMetricsCount) + // $COVERAGE-ON$ + )) } def processElement(event: T, context: ProcessFunction[T, R]#Context, metrics: Metrics): Unit @@ -143,29 +170,34 @@ abstract class BaseProcessFunction[T, R](config: BaseJobConfig[R]) extends Proce def getMetricsList(): MetricsList override def processElement(event: T, context: ProcessFunction[T, R]#Context, out: Collector[R]): Unit = { - try { - processElement(event, context, metrics) - } catch { - case exception: Exception => - logger.error(s"${config.jobName}:processElement - Exception", exception) - } + processElement(event, context, metrics) } } -abstract class WindowBaseProcessFunction[I, O, K](config: BaseJobConfig[O]) extends ProcessWindowFunction[I, O, K, TimeWindow] with BaseDeduplication with JobMetrics with BaseFunction { +abstract class WindowBaseProcessFunction[I, O, K](config: BaseJobConfig[O]) extends ProcessWindowFunction[I, O, K, TimeWindow] with JobMetrics with BaseFunction { - private[this] val logger = LoggerFactory.getLogger(this.getClass) - private val metricsList = getMetricsList() - private val metrics: Metrics = registerMetrics(metricsList.datasets, metricsList.metrics) + protected val metricsList: MetricsList = getMetricsList() + protected val metrics: Metrics = registerMetrics(metricsList.datasets, metricsList.metrics) override def open(parameters: Configuration): Unit = { - (metricsList.datasets ++ List(SystemConfig.defaultDatasetId)).map { dataset => + metricsList.datasets.map { dataset => metricsList.metrics.map(metric => { getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(dataset) - .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => metrics.getAndReset(dataset, metric))) + .gauge[Long, ScalaGauge[Long]](metric, ScalaGauge[Long](() => + // $COVERAGE-OFF$ + metrics.getAndReset(dataset, metric) + // $COVERAGE-ON$ + )) }) } + val defaultDatasetId = SystemConfig.getString("defaultDatasetId", "ALL") + getRuntimeContext.getMetricGroup.addGroup(config.jobName).addGroup(defaultDatasetId) + .gauge[Long, ScalaGauge[Long]](config.eventFailedMetricsCount, ScalaGauge[Long](() => + // $COVERAGE-OFF$ + metrics.getAndReset(defaultDatasetId, config.eventFailedMetricsCount) + // $COVERAGE-ON$ + )) } def getMetricsList(): MetricsList @@ -176,11 +208,7 @@ abstract class WindowBaseProcessFunction[I, O, K](config: BaseJobConfig[O]) exte metrics: Metrics): Unit override def process(key: K, context: ProcessWindowFunction[I, O, K, TimeWindow]#Context, elements: lang.Iterable[I], out: Collector[O]): Unit = { - try { - process(key, context, elements, metrics) - } catch { - case exception: Exception => logger.error(s"${config.jobName}:processElement - Exception", exception) - } + process(key, context, elements, metrics) } } \ No newline at end of file diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseStreamTask.scala b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseStreamTask.scala index 4862da7b..8ebdb8a7 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseStreamTask.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/BaseStreamTask.scala @@ -2,13 +2,24 @@ package org.sunbird.obsrv.core.streaming import org.apache.flink.api.common.eventtime.WatermarkStrategy -import org.apache.flink.streaming.api.datastream.DataStream +import org.apache.flink.streaming.api.datastream.{DataStream, DataStreamSink, SingleOutputStreamOperator} import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import java.util.Properties import scala.collection.mutable -abstract class BaseStreamTask[T] { +class BaseStreamTaskSink[T] { + def addDefaultSinks(dataStream: SingleOutputStreamOperator[T], config: BaseJobConfig[T], kafkaConnector: FlinkKafkaConnector): DataStreamSink[T] = { + + dataStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaSink[String](config.kafkaSystemTopic)) + .name(config.jobName + "-" + config.systemEventsProducer).uid(config.jobName + "-" + config.systemEventsProducer).setParallelism(config.downstreamOperatorsParallelism) + + dataStream.getSideOutput(config.failedEventsOutputTag()).sinkTo(kafkaConnector.kafkaSink[T](config.kafkaFailedTopic)) + .name(config.jobName + "-" + config.failedEventProducer).uid(config.jobName + "-" + config.failedEventProducer).setParallelism(config.downstreamOperatorsParallelism) + } +} + +abstract class BaseStreamTask[T] extends BaseStreamTaskSink[T] { def process() diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/FlinkKafkaConnector.scala b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/FlinkKafkaConnector.scala index 0120bd58..508e1e7c 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/streaming/FlinkKafkaConnector.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/streaming/FlinkKafkaConnector.scala @@ -8,18 +8,13 @@ import org.apache.kafka.clients.consumer.OffsetResetStrategy import org.sunbird.obsrv.core.serde._ import java.util.Properties -import scala.collection.mutable import scala.collection.JavaConverters._ +import scala.collection.mutable class FlinkKafkaConnector(config: BaseJobConfig[_]) extends Serializable { def kafkaStringSource(kafkaTopic: String): KafkaSource[String] = { - KafkaSource.builder[String]() - .setTopics(kafkaTopic) - .setDeserializer(new StringDeserializationSchema) - .setProperties(config.kafkaConsumerProperties()) - .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)) - .build() + kafkaStringSource(List(kafkaTopic), config.kafkaConsumerProperties()) } def kafkaStringSource(kafkaTopic: List[String], consumerProperties: Properties): KafkaSource[String] = { @@ -31,36 +26,31 @@ class FlinkKafkaConnector(config: BaseJobConfig[_]) extends Serializable { .build() } - def kafkaStringSink(kafkaTopic: String): KafkaSink[String] = { - KafkaSink.builder[String]() + def kafkaSink[T](kafkaTopic: String): KafkaSink[T] = { + KafkaSink.builder[T]() .setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) - .setRecordSerializer(new StringSerializationSchema(kafkaTopic)) + .setRecordSerializer(new SerializationSchema(kafkaTopic)) .setKafkaProducerConfig(config.kafkaProducerProperties) .build() } def kafkaMapSource(kafkaTopic: String): KafkaSource[mutable.Map[String, AnyRef]] = { - KafkaSource.builder[mutable.Map[String, AnyRef]]() - .setTopics(kafkaTopic) - .setDeserializer(new MapDeserializationSchema) - .setProperties(config.kafkaConsumerProperties()) - .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)) - .build() + kafkaMapSource(List(kafkaTopic), config.kafkaConsumerProperties()) } def kafkaMapSource(kafkaTopics: List[String], consumerProperties: Properties): KafkaSource[mutable.Map[String, AnyRef]] = { KafkaSource.builder[mutable.Map[String, AnyRef]]() .setTopics(kafkaTopics.asJava) .setDeserializer(new MapDeserializationSchema) - .setProperties(consumerProperties) + .setProperties(config.kafkaConsumerProperties()) .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)) .build() } - def kafkaMapSink(kafkaTopic: String): KafkaSink[mutable.Map[String, AnyRef]] = { + def kafkaMapDynamicSink(): KafkaSink[mutable.Map[String, AnyRef]] = { KafkaSink.builder[mutable.Map[String, AnyRef]]() .setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) - .setRecordSerializer(new MapSerializationSchema(kafkaTopic)) + .setRecordSerializer(new DynamicMapSerializationSchema()) .setKafkaProducerConfig(config.kafkaProducerProperties) .build() } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/util/JSONUtil.scala b/framework/src/main/scala/org/sunbird/obsrv/core/util/JSONUtil.scala index 338a47e8..67156256 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/util/JSONUtil.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/util/JSONUtil.scala @@ -1,14 +1,14 @@ package org.sunbird.obsrv.core.util -import java.lang.reflect.{ParameterizedType, Type} import com.fasterxml.jackson.annotation.JsonInclude.Include import com.fasterxml.jackson.core.JsonGenerator.Feature -import com.fasterxml.jackson.databind.{DeserializationFeature, JsonNode, MapperFeature, ObjectMapper, SerializationFeature} import com.fasterxml.jackson.core.`type`.TypeReference import com.fasterxml.jackson.databind.json.JsonMapper -import com.fasterxml.jackson.module.scala.{ClassTagExtensions, DefaultScalaModule, ScalaObjectMapper} +import com.fasterxml.jackson.databind.node.JsonNodeType +import com.fasterxml.jackson.databind.{DeserializationFeature, JsonNode, SerializationFeature} +import com.fasterxml.jackson.module.scala.{ClassTagExtensions, DefaultScalaModule} -import scala.collection.mutable +import java.lang.reflect.{ParameterizedType, Type} object JSONUtil { @@ -17,13 +17,13 @@ object JSONUtil { .disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES) .disable(SerializationFeature.FAIL_ON_EMPTY_BEANS) .enable(Feature.WRITE_BIGDECIMAL_AS_PLAIN) - .build() :: ClassTagExtensions + .build() - mapper.setSerializationInclusion(Include.NON_NULL) + mapper.setSerializationInclusion(Include.NON_ABSENT) @throws(classOf[Exception]) - def serialize(obj: AnyRef): String = { - mapper.writeValueAsString(obj) + def serialize(obj: Any): String = { + if(obj.isInstanceOf[String]) obj.asInstanceOf[String] else mapper.writeValueAsString(obj) } def deserialize[T: Manifest](json: String): T = { @@ -34,12 +34,16 @@ object JSONUtil { mapper.readValue(json, typeReference[T]) } - def isJSON(jsonString: String): Boolean = { + def getJsonType(jsonString: String): String = { try { - mapper.readTree(jsonString) - true + val node = mapper.readTree(jsonString) + node.getNodeType match { + case JsonNodeType.ARRAY => "ARRAY" + case JsonNodeType.OBJECT => "OBJECT" + case _ => "NOT_A_JSON" + } } catch { - case _: Exception => false + case _: Exception => "NOT_A_JSON" } } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/util/PostgresConnect.scala b/framework/src/main/scala/org/sunbird/obsrv/core/util/PostgresConnect.scala index 86cd61a6..8322351c 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/util/PostgresConnect.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/util/PostgresConnect.scala @@ -51,12 +51,26 @@ class PostgresConnect(config: PostgresConnectionConfig) { catch { case ex: SQLException => logger.error("PostgresConnect:execute() - Exception", ex) - reset + reset() statement.execute(query) } // $COVERAGE-ON$ } + def executeUpdate(query: String): Int = { + try { + statement.executeUpdate(query) + } + // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked if postgres connection is stale + catch { + case ex: SQLException => + logger.error("PostgresConnect:execute() - Exception", ex) + reset() + statement.executeUpdate(query) + } + // $COVERAGE-ON$ + } + def executeQuery(query:String):ResultSet = statement.executeQuery(query) } diff --git a/framework/src/main/scala/org/sunbird/obsrv/core/util/Util.scala b/framework/src/main/scala/org/sunbird/obsrv/core/util/Util.scala index d9ab6bc0..55453dec 100644 --- a/framework/src/main/scala/org/sunbird/obsrv/core/util/Util.scala +++ b/framework/src/main/scala/org/sunbird/obsrv/core/util/Util.scala @@ -5,9 +5,9 @@ import scala.collection.mutable object Util { def getMutableMap(immutableMap: Map[String, AnyRef]): mutable.Map[String, AnyRef] = { - val mutableMap = mutable.Map[String, AnyRef](); - mutableMap ++= immutableMap; + val mutableMap = mutable.Map[String, AnyRef]() + mutableMap ++= immutableMap mutableMap } -} +} \ No newline at end of file diff --git a/framework/src/test/resources/base-test.conf b/framework/src/test/resources/base-test.conf index ce48f132..aa395059 100644 --- a/framework/src/test/resources/base-test.conf +++ b/framework/src/test/resources/base-test.conf @@ -17,6 +17,7 @@ kafka { compression = "snappy" } output.system.event.topic = "flink.system.events" + output.failed.topic = "flink.failed" } job { @@ -57,7 +58,7 @@ redis { redis-meta { host = localhost - port = 6379 + port = 6340 } postgres { diff --git a/framework/src/test/resources/test.conf b/framework/src/test/resources/test.conf index 056d3989..ed3cd4fd 100644 --- a/framework/src/test/resources/test.conf +++ b/framework/src/test/resources/test.conf @@ -47,7 +47,7 @@ redis.connection.timeout = 30000 redis { host = 127.0.0.1 - port = 6341 + port = 6340 database { duplicationstore.id = 12 key.expiry.seconds = 3600 @@ -56,7 +56,7 @@ redis { redis-meta { host = localhost - port = 6341 + port = 6340 } postgres { diff --git a/framework/src/test/resources/test2.conf b/framework/src/test/resources/test2.conf new file mode 100644 index 00000000..b85fd6ce --- /dev/null +++ b/framework/src/test/resources/test2.conf @@ -0,0 +1,69 @@ +kafka { + map.input.topic = "local.map.input" + map.output.topic = "local.map.output" + event.input.topic = "local.event.input" + event.output.topic = "local.event.output" + string.input.topic = "local.string.input" + string.output.topic = "local.string.output" + producer.broker-servers = "localhost:9093" + consumer.broker-servers = "localhost:9093" + groupId = "pipeline-preprocessor-group" + producer { + max-request-size = 102400 + batch.size = 8192 + linger.ms = 1 + } + output.system.event.topic = "flink.system.events" + output.failed.topic = "flink.failed" + event.duplicate.topic = "local.duplicate.output" +} + +job { + env = "local" + statebackend { + blob { + storage { + account = "blob.storage.account" + container = "obsrv-container" + checkpointing.dir = "flink-jobs" + } + } + } +} + +kafka.output.metrics.topic = "pipeline_metrics" +task { + checkpointing.interval = 60000 + checkpointing.pause.between.seconds = 30000 + restart-strategy.attempts = 1 + restart-strategy.delay = 10000 + parallelism = 1 + consumer.parallelism = 1 + downstream.operators.parallelism = 1 +} + +redis.connection.timeout = 30000 + +redis { + host = 127.0.0.1 + port = 6340 + database { + duplicationstore.id = 12 + key.expiry.seconds = 3600 + } +} + +redis-meta { + host = localhost + port = 6340 +} + +postgres { + host = localhost + port = 5432 + maxConnection = 2 + user = "postgres" + password = "postgres" +} + +dataset.type = "master-dataset" \ No newline at end of file diff --git a/framework/src/test/scala/org/sunbird/spec/BaseDeduplicationTestSpec.scala b/framework/src/test/scala/org/sunbird/spec/BaseDeduplicationTestSpec.scala new file mode 100644 index 00000000..961b1474 --- /dev/null +++ b/framework/src/test/scala/org/sunbird/spec/BaseDeduplicationTestSpec.scala @@ -0,0 +1,45 @@ +package org.sunbird.spec + +import com.typesafe.config.{Config, ConfigFactory} +import org.scalatest.Matchers +import org.scalatestplus.mockito.MockitoSugar +import org.sunbird.obsrv.core.cache.{DedupEngine, RedisConnect} +import org.sunbird.obsrv.core.exception.ObsrvException +import org.sunbird.obsrv.core.model.ErrorConstants +import org.sunbird.obsrv.core.streaming.BaseDeduplication +class BaseDeduplicationTestSpec extends BaseSpec with Matchers with MockitoSugar { + + val config: Config = ConfigFactory.load("base-test.conf") + val baseConfig = new BaseProcessTestConfig(config) + val SAMPLE_EVENT: String = """{"dataset":"d1","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealer":"KUNUnited","locationId":"KUN1"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""".stripMargin + + "BaseDeduplicationTestSpec" should "be able to cover all scenarios of deduplication check" in { + val redisConnection = new RedisConnect(baseConfig.redisHost, baseConfig.redisPort, baseConfig.redisConnectionTimeout) + val dedupEngine = new DedupEngine(redisConnection, 0, 4309535) + val dedupFn = new DeduplicationFn(dedupEngine) + + dedupFn.validateDedup("d1", Some("event.id"), SAMPLE_EVENT) should be (false) + dedupFn.validateDedup("d1", Some("event.id"), SAMPLE_EVENT) should be (true) + + the[ObsrvException] thrownBy { + dedupFn.validateDedup("d1", Some("event"), SAMPLE_EVENT) + } should have message ErrorConstants.DEDUP_KEY_NOT_A_STRING_OR_NUMBER.errorMsg + + the[ObsrvException] thrownBy { + dedupFn.validateDedup("d1", Some("event.mid"), SAMPLE_EVENT) + } should have message ErrorConstants.NO_DEDUP_KEY_FOUND.errorMsg + + the[ObsrvException] thrownBy { + dedupFn.validateDedup("d1", None, SAMPLE_EVENT) + } should have message ErrorConstants.NO_DEDUP_KEY_FOUND.errorMsg + + } +} + +class DeduplicationFn(dedupEngine: DedupEngine) extends BaseDeduplication { + + def validateDedup(datasetId: String, dedupKey: Option[String], event: String):Boolean = { + isDuplicate(datasetId, dedupKey, event)(dedupEngine) + } + +} \ No newline at end of file diff --git a/framework/src/test/scala/org/sunbird/spec/BaseProcessFunctionTestSpec.scala b/framework/src/test/scala/org/sunbird/spec/BaseProcessFunctionTestSpec.scala index 61eb53f1..bac2b0ae 100644 --- a/framework/src/test/scala/org/sunbird/spec/BaseProcessFunctionTestSpec.scala +++ b/framework/src/test/scala/org/sunbird/spec/BaseProcessFunctionTestSpec.scala @@ -13,9 +13,8 @@ import org.apache.flink.streaming.api.windowing.time.Time import org.apache.flink.test.util.MiniClusterWithClientResource import org.apache.kafka.common.serialization.StringDeserializer import org.scalatest.Matchers -import org.sunbird.obsrv.core.model.ErrorConstants import org.sunbird.obsrv.core.streaming._ -import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, Util} +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnectionConfig, Util, PostgresConnect} import java.util.concurrent.ConcurrentHashMap import java.util.concurrent.atomic.AtomicLong @@ -24,14 +23,20 @@ import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.duration._ -class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { - +class BaseProcessFunctionTestSpec extends BaseSpecWithPostgres with Matchers { val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() .setNumberSlotsPerTaskManager(1) .setNumberTaskManagers(1) .build) val config: Config = ConfigFactory.load("base-test.conf") + val postgresConfig: PostgresConnectionConfig = PostgresConnectionConfig( + config.getString("postgres.user"), + config.getString("postgres.password"), + config.getString("postgres.database"), + config.getString("postgres.host"), + config.getInt("postgres.port"), + config.getInt("postgres.maxConnections")) val bsMapConfig = new BaseProcessTestMapConfig(config) val bsConfig = new BaseProcessTestConfig(config) val kafkaConnector = new FlinkKafkaConnector(bsConfig) @@ -52,7 +57,8 @@ class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { override def beforeAll(): Unit = { super.beforeAll() - + val postgresConnect = new PostgresConnect(postgresConfig) + createSystemSettings(postgresConnect) EmbeddedKafka.start()(embeddedKafkaConfig) createTestTopics(bsConfig.testTopics) @@ -66,6 +72,8 @@ class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { } override protected def afterAll(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + clearSystemSettings(postgresConnect) super.afterAll() flinkCluster.after() EmbeddedKafka.stop() @@ -85,7 +93,7 @@ class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { .process(new TestMapStreamFunc(bsMapConfig)).name("TestMapEventStream") mapStream.getSideOutput(bsConfig.mapOutputTag) - .sinkTo(kafkaConnector.kafkaMapSink(bsConfig.kafkaMapOutputTopic)) + .sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](bsConfig.kafkaMapOutputTopic)) .name("Map-Event-Producer") val stringStream = @@ -95,7 +103,7 @@ class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { }).window(TumblingProcessingTimeWindows.of(Time.seconds(2))).process(new TestStringWindowStreamFunc(bsConfig)).name("TestStringEventStream") stringStream.getSideOutput(bsConfig.stringOutputTag) - .sinkTo(kafkaConnector.kafkaStringSink(bsConfig.kafkaStringOutputTopic)) + .sinkTo(kafkaConnector.kafkaSink[String](bsConfig.kafkaStringOutputTopic)) .name("String-Producer") Future { @@ -131,26 +139,10 @@ class BaseProcessFunctionTestSpec extends BaseSpec with Matchers { val mutableMap = Util.getMutableMap(map) mutableMap.getClass.getCanonicalName should be ("scala.collection.mutable.HashMap") noException shouldBe thrownBy(JSONUtil.convertValue(map)) - - ErrorConstants.NO_IMPLEMENTATION_FOUND.errorCode should be ("ERR_0001") - ErrorConstants.NO_EXTRACTION_DATA_FOUND.errorCode should be ("ERR_EXT_1001") - ErrorConstants.EXTRACTED_DATA_NOT_A_LIST.errorCode should be ("ERR_EXT_1002") - ErrorConstants.EVENT_SIZE_EXCEEDED.errorCode should be ("ERR_EXT_1003") - ErrorConstants.EVENT_MISSING.errorCode should be ("ERR_EXT_1006") - ErrorConstants.MISSING_DATASET_ID.errorCode should be ("ERR_EXT_1004") - ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode should be ("ERR_EXT_1005") - ErrorConstants.NO_DEDUP_KEY_FOUND.errorCode should be ("ERR_DEDUP_1007") - ErrorConstants.DEDUP_KEY_NOT_A_STRING.errorCode should be ("ERR_DEDUP_1008") - ErrorConstants.DUPLICATE_BATCH_EVENT_FOUND.errorCode should be ("ERR_EXT_1009") - ErrorConstants.DUPLICATE_EVENT_FOUND.errorCode should be ("ERR_PP_1010") - ErrorConstants.JSON_SCHEMA_NOT_FOUND.errorCode should be ("ERR_PP_1011") - ErrorConstants.INVALID_JSON_SCHEMA.errorCode should be ("ERR_PP_1012") - ErrorConstants.SCHEMA_VALIDATION_FAILED.errorCode should be ("ERR_PP_1013") - ErrorConstants.DENORM_KEY_MISSING.errorCode should be ("ERR_DENORM_1014") - ErrorConstants.DENORM_KEY_NOT_A_STRING.errorCode should be ("ERR_DENORM_1015") - - val metrics = Metrics(Map("test" -> new ConcurrentHashMap[String, AtomicLong]())) + val metrics = Metrics(mutable.Map("test" -> new ConcurrentHashMap[String, AtomicLong]())) metrics.reset("test1", "m1") + + bsConfig.datasetType() should be ("dataset") } "TestBaseStreamTask" should "validate the getMapDataStream method" in { diff --git a/framework/src/test/scala/org/sunbird/spec/BaseProcessTestConfig.scala b/framework/src/test/scala/org/sunbird/spec/BaseProcessTestConfig.scala index 6384c2c1..8cf3521f 100644 --- a/framework/src/test/scala/org/sunbird/spec/BaseProcessTestConfig.scala +++ b/framework/src/test/scala/org/sunbird/spec/BaseProcessTestConfig.scala @@ -35,6 +35,8 @@ class BaseProcessTestConfig(override val config: Config) extends BaseJobConfig[S override def inputConsumer(): String = "testConsumer" override def successTag(): OutputTag[String] = stringOutputTag + + override def failedEventsOutputTag(): OutputTag[String] = stringOutputTag } class BaseProcessTestMapConfig(override val config: Config) extends BaseJobConfig[Map[String, AnyRef]](config, "Test-job") { @@ -68,4 +70,6 @@ class BaseProcessTestMapConfig(override val config: Config) extends BaseJobConfi override def inputConsumer(): String = "testConsumer" override def successTag(): OutputTag[Map[String, AnyRef]] = mapOutputTag -} \ No newline at end of file + + override def failedEventsOutputTag(): OutputTag[Map[String, AnyRef]] = mapOutputTag +} diff --git a/framework/src/test/scala/org/sunbird/spec/BaseSpec.scala b/framework/src/test/scala/org/sunbird/spec/BaseSpec.scala index 134121d3..904a320b 100644 --- a/framework/src/test/scala/org/sunbird/spec/BaseSpec.scala +++ b/framework/src/test/scala/org/sunbird/spec/BaseSpec.scala @@ -1,6 +1,5 @@ package org.sunbird.spec -import io.zonky.test.db.postgres.embedded.EmbeddedPostgres import org.scalatest.{BeforeAndAfterAll, FlatSpec} import redis.embedded.RedisServer @@ -8,10 +7,15 @@ class BaseSpec extends FlatSpec with BeforeAndAfterAll { var redisServer: RedisServer = _ - override def beforeAll() { + override def beforeAll(): Unit = { super.beforeAll() redisServer = new RedisServer(6340) - redisServer.start() + try { + redisServer.start() + } catch { + case _: Exception => Console.err.println("### Unable to start redis server. Falling back to use locally run redis if any ###") + } + } override protected def afterAll(): Unit = { diff --git a/framework/src/test/scala/org/sunbird/spec/BaseSpecWithPostgres.scala b/framework/src/test/scala/org/sunbird/spec/BaseSpecWithPostgres.scala index d44191be..86f0afa6 100644 --- a/framework/src/test/scala/org/sunbird/spec/BaseSpecWithPostgres.scala +++ b/framework/src/test/scala/org/sunbird/spec/BaseSpecWithPostgres.scala @@ -2,6 +2,7 @@ package org.sunbird.spec import io.zonky.test.db.postgres.embedded.EmbeddedPostgres import org.scalatest.{BeforeAndAfterAll, FlatSpec} +import org.sunbird.obsrv.core.util.PostgresConnect import redis.embedded.RedisServer class BaseSpecWithPostgres extends FlatSpec with BeforeAndAfterAll { @@ -9,10 +10,14 @@ class BaseSpecWithPostgres extends FlatSpec with BeforeAndAfterAll { var embeddedPostgres: EmbeddedPostgres = _ var redisServer: RedisServer = _ - override def beforeAll() { + override def beforeAll(): Unit = { super.beforeAll() redisServer = new RedisServer(6340) - redisServer.start() + try { + redisServer.start() + } catch { + case _: Exception => Console.err.println("### Unable to start redis server. Falling back to use locally run redis if any ###") + } embeddedPostgres = EmbeddedPostgres.builder.setPort(5432).start() // Defaults to 5432 port } @@ -22,4 +27,16 @@ class BaseSpecWithPostgres extends FlatSpec with BeforeAndAfterAll { embeddedPostgres.close() } + def createSystemSettings(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("CREATE TABLE IF NOT EXISTS system_settings ( key text NOT NULL, value text NOT NULL, category text NOT NULL DEFAULT 'SYSTEM'::text, valuetype text NOT NULL, created_date timestamp NOT NULL DEFAULT now(), updated_date timestamp, label text, PRIMARY KEY (\"key\"));") + postgresConnect.execute("insert into system_settings values('defaultDedupPeriodInSeconds', '604801', 'system', 'int', now(), now(), 'Dedup Period in Seconds');") + postgresConnect.execute("insert into system_settings values('maxEventSize', '1048676', 'system', 'long', now(), now(), 'Max Event Size');") + postgresConnect.execute("insert into system_settings values('defaultDatasetId', 'ALL', 'system', 'string', now(), now(), 'Default Dataset Id');") + postgresConnect.execute("insert into system_settings values('encryptionSecretKey', 'ckW5GFkTtMDNGEr5k67YpQMEBJNX3x2f', 'system', 'string', now(), now(), 'Encryption Secret Key');") + postgresConnect.execute("insert into system_settings values('enable', 'true', 'system', 'boolean', now(), now(), 'Enable flag');") + } + + def clearSystemSettings(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("DROP TABLE system_settings;") + } } diff --git a/framework/src/test/scala/org/sunbird/spec/ModelsTestSpec.scala b/framework/src/test/scala/org/sunbird/spec/ModelsTestSpec.scala new file mode 100644 index 00000000..4ca0ad5e --- /dev/null +++ b/framework/src/test/scala/org/sunbird/spec/ModelsTestSpec.scala @@ -0,0 +1,118 @@ +package org.sunbird.spec + +import com.fasterxml.jackson.module.scala.JsonScalaEnumeration +import com.typesafe.config.{Config, ConfigFactory} +import org.apache.kafka.clients.producer.ProducerConfig +import org.scalatest.{FlatSpec, Matchers} +import org.sunbird.obsrv.core.model.FunctionalError.FunctionalError +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.util.{DatasetKeySelector, JSONUtil} + +import scala.collection.mutable + +case class FuncErrorList(@JsonScalaEnumeration(classOf[FunctionalErrorType]) list: List[FunctionalError]) +class ModelsTestSpec extends FlatSpec with Matchers { + + "ModelsTestSpec" should "cover all error constants" in { + + ErrorConstants.NO_IMPLEMENTATION_FOUND.errorCode should be("ERR_0001") + ErrorConstants.NO_EXTRACTION_DATA_FOUND.errorCode should be("ERR_EXT_1001") + ErrorConstants.EXTRACTED_DATA_NOT_A_LIST.errorCode should be("ERR_EXT_1002") + ErrorConstants.EVENT_SIZE_EXCEEDED.errorCode should be("ERR_EXT_1003") + ErrorConstants.EVENT_MISSING.errorCode should be("ERR_EXT_1006") + ErrorConstants.MISSING_DATASET_ID.errorCode should be("ERR_EXT_1004") + ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode should be("ERR_EXT_1005") + ErrorConstants.NO_DEDUP_KEY_FOUND.errorCode should be("ERR_DEDUP_1007") + ErrorConstants.DEDUP_KEY_NOT_A_STRING_OR_NUMBER.errorCode should be("ERR_DEDUP_1008") + ErrorConstants.DUPLICATE_BATCH_EVENT_FOUND.errorCode should be("ERR_EXT_1009") + ErrorConstants.DUPLICATE_EVENT_FOUND.errorCode should be("ERR_PP_1010") + ErrorConstants.JSON_SCHEMA_NOT_FOUND.errorCode should be("ERR_PP_1011") + ErrorConstants.INVALID_JSON_SCHEMA.errorCode should be("ERR_PP_1012") + ErrorConstants.SCHEMA_VALIDATION_FAILED.errorCode should be("ERR_PP_1013") + ErrorConstants.DENORM_KEY_MISSING.errorCode should be("ERR_DENORM_1014") + ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER.errorCode should be("ERR_DENORM_1015") + ErrorConstants.DENORM_DATA_NOT_FOUND.errorCode should be("ERR_DENORM_1016") + ErrorConstants.MISSING_DATASET_CONFIG_KEY.errorCode should be("ERR_MASTER_DATA_1017") + ErrorConstants.ERR_INVALID_EVENT.errorCode should be("ERR_EXT_1018") + ErrorConstants.INDEX_KEY_MISSING_OR_BLANK.errorCode should be("ERR_ROUTER_1019") + ErrorConstants.INVALID_EXPR_FUNCTION.errorCode should be("ERR_TRANSFORM_1020") + ErrorConstants.ERR_EVAL_EXPR_FUNCTION.errorCode should be("ERR_TRANSFORM_1021") + ErrorConstants.ERR_UNKNOWN_TRANSFORM_EXCEPTION.errorCode should be("ERR_TRANSFORM_1022") + ErrorConstants.ERR_TRANSFORMATION_FAILED.errorCode should be("ERR_TRANSFORM_1023") + } + + it should "cover system event model" in { + + Stats.withName("latency_time") should be (Stats.latency_time) + Stats.withName("processing_time") should be (Stats.processing_time) + Stats.withName("total_processing_time") should be (Stats.total_processing_time) + + PDataType.withName("flink") should be(PDataType.flink) + PDataType.withName("api") should be(PDataType.api) + PDataType.withName("kafka") should be(PDataType.kafka) + PDataType.withName("druid") should be(PDataType.druid) + PDataType.withName("spark") should be(PDataType.spark) + + StatusCode.withName("failed") should be (StatusCode.failed) + StatusCode.withName("partial") should be (StatusCode.partial) + StatusCode.withName("skipped") should be (StatusCode.skipped) + StatusCode.withName("success") should be (StatusCode.success) + + ModuleID.withName("ingestion") should be(ModuleID.ingestion) + ModuleID.withName("processing") should be(ModuleID.processing) + ModuleID.withName("storage") should be(ModuleID.storage) + ModuleID.withName("query") should be(ModuleID.query) + + ModuleID.withName("ingestion") should be(ModuleID.ingestion) + ModuleID.withName("processing") should be(ModuleID.processing) + ModuleID.withName("storage") should be(ModuleID.storage) + ModuleID.withName("query") should be(ModuleID.query) + + Producer.withName("extractor") should be(Producer.extractor) + Producer.withName("validator") should be(Producer.validator) + Producer.withName("dedup") should be(Producer.dedup) + Producer.withName("denorm") should be(Producer.denorm) + Producer.withName("transformer") should be(Producer.transformer) + Producer.withName("router") should be(Producer.router) + Producer.withName("masterdataprocessor") should be(Producer.masterdataprocessor) + + EventID.withName("METRIC") should be (EventID.METRIC) + EventID.withName("LOG") should be (EventID.LOG) + + ErrorLevel.withName("info") should be (ErrorLevel.info) + ErrorLevel.withName("warn") should be (ErrorLevel.warn) + ErrorLevel.withName("debug") should be (ErrorLevel.debug) + ErrorLevel.withName("critical") should be (ErrorLevel.critical) + + val funcErrorsStringList = FunctionalError.values.map(f => f.toString).toList + val funcErrors = JSONUtil.deserialize[FuncErrorList](JSONUtil.serialize(Map("list" -> funcErrorsStringList))) + funcErrors.list.contains(FunctionalError.MissingTimestampKey) should be (true) + + val sysEvent = SystemEvent(etype = EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(id = "testjob", `type` = PDataType.flink, pid = Some(Producer.router)), dataset = Some("d1"), eid = Some("event1")), + data = EData( + error = Some(ErrorLog(pdata_id = Producer.router, pdata_status = StatusCode.failed, error_type = FunctionalError.MissingTimestampKey, error_code = ErrorConstants.DENORM_KEY_MISSING.errorCode, error_message = ErrorConstants.DENORM_KEY_MISSING.errorMsg, error_level = ErrorLevel.warn, error_count = Some(1))), + pipeline_stats = Some(PipelineStats(extractor_events = Some(2), extractor_status = Some(StatusCode.success), extractor_time = Some(123l), validator_status = Some(StatusCode.success), validator_time = Some(786l), dedup_status = Some(StatusCode.skipped), dedup_time = Some(0l), denorm_status = Some(StatusCode.partial), denorm_time = Some(345l), transform_status = Some(StatusCode.success), transform_time = Some(98l), total_processing_time = Some(1543l), latency_time = Some(23l), processing_time = Some(1520l))), Some(Map("duration" -> 2000.asInstanceOf[AnyRef])) + ) + ) + sysEvent.etype should be (EventID.METRIC) + + val config: Config = ConfigFactory.load("test2.conf") + val bsMapConfig = new BaseProcessTestMapConfig(config) + bsMapConfig.kafkaProducerProperties.get(ProducerConfig.COMPRESSION_TYPE_CONFIG).asInstanceOf[String] should be ("snappy") + bsMapConfig.kafkaConsumerProperties() + bsMapConfig.enableDistributedCheckpointing should be (None) + bsMapConfig.checkpointingBaseUrl should be (None) + bsMapConfig.datasetType() should be ("master-dataset") + + val dsk = new DatasetKeySelector() + dsk.getKey(mutable.Map("dataset" -> "d1".asInstanceOf[AnyRef])) should be ("d1") + + JSONUtil.getJsonType("""{"test":123}""") should be ("OBJECT") + JSONUtil.getJsonType("""{"test":123""") should be ("NOT_A_JSON") + JSONUtil.getJsonType("""123""") should be ("NOT_A_JSON") + + } + +} diff --git a/framework/src/test/scala/org/sunbird/spec/PostgresConnectSpec.scala b/framework/src/test/scala/org/sunbird/spec/PostgresConnectSpec.scala index 619769fd..088c2768 100644 --- a/framework/src/test/scala/org/sunbird/spec/PostgresConnectSpec.scala +++ b/framework/src/test/scala/org/sunbird/spec/PostgresConnectSpec.scala @@ -34,7 +34,7 @@ class PostgresConnectSpec extends BaseSpecWithPostgres with Matchers with Mockit assertEquals("custchannel", rs.getString("channel")) } - val resetConnection = postgresConnect.reset + val resetConnection: Unit = postgresConnect.reset() assertNotNull(resetConnection) postgresConnect.closeConnection() } diff --git a/framework/src/test/scala/org/sunbird/spec/RedisTestSpec.scala b/framework/src/test/scala/org/sunbird/spec/RedisTestSpec.scala index a12543aa..9e897b74 100644 --- a/framework/src/test/scala/org/sunbird/spec/RedisTestSpec.scala +++ b/framework/src/test/scala/org/sunbird/spec/RedisTestSpec.scala @@ -17,34 +17,38 @@ class RedisTestSpec extends BaseSpec with Matchers with MockitoSugar { val redisConnection = new RedisConnect(baseConfig.redisHost, baseConfig.redisPort, baseConfig.redisConnectionTimeout) val status = redisConnection.getConnection(2) status.isConnected should be(true) + + val status2 = redisConnection.getConnection(2, 1000l) + status2.isConnected should be(true) } - "DedupEngine functionality" should "be able to identify if the key is unique or duplicate & it should able throw jedis excption for invalid action" in intercept[JedisException] { + "DedupEngine functionality" should "be able to identify if the key is unique or duplicate & it should able throw jedis exception for invalid action" in { val redisConnection = new RedisConnect(baseConfig.redisHost, baseConfig.redisPort, baseConfig.redisConnectionTimeout) val dedupEngine = new DedupEngine(redisConnection, 2, 200) - dedupEngine.getRedisConnection should not be (null) + dedupEngine.getRedisConnection should not be null dedupEngine.isUniqueEvent("key-1") should be(true) dedupEngine.storeChecksum("key-1") dedupEngine.isUniqueEvent("key-1") should be(false) - dedupEngine.isUniqueEvent(null) + a[JedisException] should be thrownBy {dedupEngine.isUniqueEvent(null)} dedupEngine.closeConnectionPool() } - it should "be able to reconnect when a jedis exception for invalid action is thrown" in intercept[JedisException] { + it should "be able to reconnect when a jedis exception for invalid action is thrown" in { val redisConnection = new RedisConnect(baseConfig.redisHost, baseConfig.redisPort, baseConfig.redisConnectionTimeout) val dedupEngine = new DedupEngine(redisConnection, 0, 4309535) dedupEngine.isUniqueEvent("event-id-3") should be(true) - dedupEngine.storeChecksum(null) - dedupEngine.getRedisConnection should not be(null) + a[JedisException] should be thrownBy {dedupEngine.storeChecksum(null)} + dedupEngine.getRedisConnection should not be null } - - "RestUtil functionality" should "be able to return response" in { val restUtil = new RestUtil() - val url = "https://httpbin.org/json"; - val response = restUtil.get(url); + val url = "https://httpbin.org/json" + val response = restUtil.get(url, Some(Map("x-auth" -> "123"))) response should not be null + + val response2 = restUtil.get("https://httpbin.org/json") + response2 should not be null } } \ No newline at end of file diff --git a/framework/src/test/scala/org/sunbird/spec/SerdeUtilTestSpec.scala b/framework/src/test/scala/org/sunbird/spec/SerdeUtilTestSpec.scala new file mode 100644 index 00000000..74ac75f6 --- /dev/null +++ b/framework/src/test/scala/org/sunbird/spec/SerdeUtilTestSpec.scala @@ -0,0 +1,75 @@ +package org.sunbird.spec + +import org.apache.flink.util.Collector +import org.apache.kafka.clients.consumer.ConsumerRecord +import org.apache.kafka.common.record.TimestampType +import org.scalamock.matchers.ArgCapture.CaptureAll +import org.scalamock.scalatest.MockFactory +import org.scalatest.{FlatSpec, Matchers} +import org.sunbird.obsrv.core.serde.{MapDeserializationSchema, StringDeserializationSchema} + +import java.nio.charset.StandardCharsets +import scala.collection.mutable + +class SerdeUtilTestSpec extends FlatSpec with Matchers with MockFactory { + + + + "SerdeUtil" should "test all the map and string deserialization classes" in { + + val strCollector: Collector[String] = mock[Collector[String]] + val mapCollector: Collector[mutable.Map[String, AnyRef]] = mock[Collector[mutable.Map[String, AnyRef]]] + val key = "key1".getBytes(StandardCharsets.UTF_8) + val validEvent = """{"event":{"id":1234}}""".getBytes(StandardCharsets.UTF_8) + val eventWithObsrvMeta = """{"event":{"id":1234},"obsrv_meta":{}}""".getBytes(StandardCharsets.UTF_8) + val invalidEvent = """{"event":{"id":1234}""".getBytes(StandardCharsets.UTF_8) + + val validRecord = new ConsumerRecord[Array[Byte], Array[Byte]]("test-topic", 0, 1234l, 1701447470737l, TimestampType.CREATE_TIME, -1l, -1, -1, key, validEvent) + val validRecordWithObsrvMeta = new ConsumerRecord[Array[Byte], Array[Byte]]("test-topic", 0, 1234l, 1701447470737l, TimestampType.CREATE_TIME, -1l, -1, -1, key, eventWithObsrvMeta) + val invalidRecord = new ConsumerRecord[Array[Byte], Array[Byte]]("test-topic", 0, 1234l, 1701447470737l, TimestampType.CREATE_TIME, -1l, -1, -1, key, invalidEvent) + + + val sds = new StringDeserializationSchema() + (strCollector.collect _).expects("""{"event":{"id":1234}}""") + sds.deserialize(validRecord, strCollector) + + val c = CaptureAll[mutable.Map[String, AnyRef]]() + val mds = new MapDeserializationSchema() + mapCollector.collect _ expects capture(c) repeat 3 + mds.deserialize(validRecord, mapCollector) + mds.deserialize(validRecordWithObsrvMeta, mapCollector) + mds.deserialize(invalidRecord, mapCollector) + //(mapCollector.collect _).verify(*).once() + val validMsg: mutable.Map[String, AnyRef] = c.values.apply(0) + val validMsgWithObsrvMeta: mutable.Map[String, AnyRef] = c.values.apply(1) + val invalidMsg: mutable.Map[String, AnyRef] = c.values.apply(2) + Console.println("validMsg", validMsg) + validMsg.get("obsrv_meta").isDefined should be (true) + val validObsrvMeta = validMsg.get("obsrv_meta").get.asInstanceOf[Map[String, AnyRef]] + val validEventMsg = validMsg.get("event").get.asInstanceOf[Map[String, AnyRef]] + validObsrvMeta.get("syncts").get.asInstanceOf[Long] should be (1701447470737l) + validObsrvMeta.get("processingStartTime").get.asInstanceOf[Long] should be >= 1701447470737l + validEventMsg.get("id").get.asInstanceOf[Int] should be (1234) + + Console.println("validMsgWithObsrvMeta", validMsgWithObsrvMeta) + validMsgWithObsrvMeta.get("obsrv_meta").isDefined should be(true) + validMsgWithObsrvMeta.get("event").isDefined should be(true) + val validObsrvMeta2 = validMsgWithObsrvMeta.get("obsrv_meta").get.asInstanceOf[Map[String, AnyRef]] + val validEventMsg2 = validMsgWithObsrvMeta.get("event").get.asInstanceOf[Map[String, AnyRef]] + validObsrvMeta2.keys.size should be(0) + validEventMsg2.get("id").get.asInstanceOf[Int] should be (1234) + + Console.println("invalidMsg", invalidMsg) + invalidMsg.get("obsrv_meta").isDefined should be(true) + invalidMsg.get("event").isDefined should be(false) + val invalidObsrvMeta = invalidMsg.get("obsrv_meta").get.asInstanceOf[Map[String, AnyRef]] + val invalidEventMsg = invalidMsg.get("invalid_json").get.asInstanceOf[String] + invalidObsrvMeta.get("syncts").get.asInstanceOf[Long] should be(1701447470737l) + invalidObsrvMeta.get("processingStartTime").get.asInstanceOf[Long] should be >= 1701447470737l + invalidEventMsg should be("""{"event":{"id":1234}""") + } + + it should "test generic serialization schema" in { + + } +} diff --git a/framework/src/test/scala/org/sunbird/spec/SystemConfigSpec.scala b/framework/src/test/scala/org/sunbird/spec/SystemConfigSpec.scala new file mode 100644 index 00000000..1907e82d --- /dev/null +++ b/framework/src/test/scala/org/sunbird/spec/SystemConfigSpec.scala @@ -0,0 +1,114 @@ +package org.sunbird.spec + +import com.typesafe.config.{Config, ConfigFactory} +import org.scalamock.scalatest.MockFactory +import org.scalatest.Matchers +import org.sunbird.obsrv.core.model.{SystemConfig, SystemConfigService} +import org.sunbird.obsrv.core.util.{PostgresConnect, PostgresConnectionConfig} + +class SystemConfigSpec extends BaseSpecWithPostgres with Matchers with MockFactory { + val configFile: Config = ConfigFactory.load("base-test.conf") + val postgresConfig: PostgresConnectionConfig = PostgresConnectionConfig( + configFile.getString("postgres.user"), + configFile.getString("postgres.password"), + configFile.getString("postgres.database"), + configFile.getString("postgres.host"), + configFile.getInt("postgres.port"), + configFile.getInt("postgres.maxConnections")) + + override def beforeAll(): Unit = { + super.beforeAll() + val postgresConnect = new PostgresConnect(postgresConfig) + createSystemSettings(postgresConnect) + } + + override def afterAll(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + clearSystemSettings(postgresConnect) + super.afterAll() + } + + def createInvalidSystemSettings(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("CREATE TABLE IF NOT EXISTS system_settings ( key text NOT NULL, value text NOT NULL, category text NOT NULL DEFAULT 'SYSTEM'::text, valuetype text NOT NULL, created_date timestamp NOT NULL DEFAULT now(), updated_date timestamp, label text, PRIMARY KEY (\"key\"));") + postgresConnect.execute("insert into system_settings values('defaultDedupPeriodInSeconds', '604801', 'system', 'double', now(), now(), 'Dedup Period in Seconds');") + postgresConnect.execute("insert into system_settings values('maxEventSize', '1048676', 'system', 'inv', now(), now(), 'Max Event Size');") + postgresConnect.execute("insert into system_settings values('defaultDatasetId', 'ALL', 'system', 'random', now(), now(), 'Default Dataset Id');") + postgresConnect.execute("insert into system_settings values('encryptionSecretKey', 'ckW5GFkTtMDNGEr5k67YpQMEBJNX3x2f', 'system', 'text', now(), now(), 'Encryption Secret Key');") + } + + "SystemConfig" should "populate configurations with values from database" in { + SystemConfig.getInt("defaultDedupPeriodInSeconds") should be(604801) + SystemConfig.getInt("defaultDedupPeriodInSeconds", 604800) should be(604801) + SystemConfig.getLong("maxEventSize", 100L) should be(1048676L) + SystemConfig.getString("defaultDatasetId", "NEW") should be("ALL") + SystemConfig.getString("encryptionSecretKey", "test") should be("ckW5GFkTtMDNGEr5k67YpQMEBJNX3x2f") + SystemConfig.getBoolean("enable", false) should be(true) + } + + "SystemConfig" should "return default values when keys are not present in db" in { + val postgresConnect = new PostgresConnect(postgresConfig) + postgresConnect.execute("TRUNCATE TABLE system_settings;") + SystemConfig.getInt("defaultDedupPeriodInSeconds", 604800) should be(604800) + SystemConfig.getLong("maxEventSize", 100L) should be(100L) + SystemConfig.getString("defaultDatasetId", "NEW") should be("NEW") + SystemConfig.getString("encryptionSecretKey", "test") should be("test") + SystemConfig.getBoolean("enable", false) should be(false) + } + + "SystemConfig" should "throw exception when valueType doesn't match" in { + val postgresConnect = new PostgresConnect(postgresConfig) + clearSystemSettings(postgresConnect) + createInvalidSystemSettings(postgresConnect) + val thrown = intercept[Exception] { + SystemConfig.getInt("defaultDedupPeriodInSeconds", 604800) + } + thrown.getMessage should be("Invalid value type for system setting") + } + + "SystemConfig" should "throw exception when valueType doesn't match without default value" in { + val postgresConnect = new PostgresConnect(postgresConfig) + clearSystemSettings(postgresConnect) + createInvalidSystemSettings(postgresConnect) + val thrown = intercept[Exception] { + SystemConfig.getInt("defaultDedupPeriodInSeconds") + } + thrown.getMessage should be("Invalid value type for system setting") + } + + "SystemConfigService" should "return all system settings" in { + val systemSettings = SystemConfigService.getAllSystemSettings + systemSettings.size should be(4) + systemSettings.map(f => { + f.key match { + case "defaultDedupPeriodInSeconds" => f.value should be("604801") + case "maxEventSize" => f.value should be("1048676") + case "defaultDatasetId" => f.value should be("ALL") + case "encryptionSecretKey" => f.value should be("ckW5GFkTtMDNGEr5k67YpQMEBJNX3x2f") + case "enable" => f.value should be("true") + } + }) + } + + "SystemConfig" should "throw exception when the key is not present in db" in { + var thrown = intercept[Exception] { + SystemConfig.getInt("invalidKey") + } + thrown.getMessage should be("System setting not found for requested key") + + thrown = intercept[Exception] { + SystemConfig.getString("invalidKey") + } + thrown.getMessage should be("System setting not found for requested key") + + thrown = intercept[Exception] { + SystemConfig.getBoolean("invalidKey") + } + thrown.getMessage should be("System setting not found for requested key") + + thrown = intercept[Exception] { + SystemConfig.getLong("invalidKey") + } + thrown.getMessage should be("System setting not found for requested key") + } + +} diff --git a/framework/src/test/scala/org/sunbird/spec/TestMapStreamFunc.scala b/framework/src/test/scala/org/sunbird/spec/TestMapStreamFunc.scala index 3e06de54..457f26ba 100644 --- a/framework/src/test/scala/org/sunbird/spec/TestMapStreamFunc.scala +++ b/framework/src/test/scala/org/sunbird/spec/TestMapStreamFunc.scala @@ -1,16 +1,20 @@ package org.sunbird.spec -import scala.collection.mutable.Map import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.streaming.api.functions.ProcessFunction import org.sunbird.obsrv.core.cache.{DedupEngine, RedisConnect} -import org.sunbird.obsrv.core.model.ErrorConstants -import org.sunbird.obsrv.core.streaming.{BaseProcessFunction, Metrics, MetricsList} +import org.sunbird.obsrv.core.model.{Constants, ErrorConstants, Producer} +import org.sunbird.obsrv.core.streaming.{BaseDeduplication, BaseProcessFunction, Metrics, MetricsList} import org.sunbird.obsrv.core.util.JSONUtil +import java.util.concurrent.ConcurrentHashMap +import java.util.concurrent.atomic.AtomicLong +import scala.collection.mutable +import scala.collection.mutable.Map + class TestMapStreamFunc(config: BaseProcessTestMapConfig)(implicit val stringTypeInfo: TypeInformation[String]) - extends BaseProcessFunction[Map[String, AnyRef], Map[String, AnyRef]](config) { + extends BaseProcessFunction[Map[String, AnyRef], Map[String, AnyRef]](config) with BaseDeduplication { override def getMetricsList(): MetricsList = { MetricsList(List("ALL"), List(config.mapEventCount)) @@ -23,27 +27,32 @@ class TestMapStreamFunc(config: BaseProcessTestMapConfig)(implicit val stringTyp metrics.reset("ALL", config.mapEventCount) metrics.incCounter("ALL", config.mapEventCount) metrics.getAndReset("ALL", config.mapEventCount) - context.output(config.mapOutputTag, event) + assert(metrics.hasDataset("ALL")) + metrics.initDataset("d2", new ConcurrentHashMap[String, AtomicLong]()) + + context.output(config.mapOutputTag, mutable.Map(Constants.TOPIC -> config.kafkaMapOutputTopic, Constants.MESSAGE -> event)) - super.markSuccess(event, "test-job") - super.markFailed(event, ErrorConstants.NO_IMPLEMENTATION_FOUND, config.jobName) - super.markSkipped(event, config.jobName) + super.markSuccess(event, Producer.extractor) + super.markFailed(event, ErrorConstants.NO_IMPLEMENTATION_FOUND, Producer.extractor) + super.markSkipped(event, Producer.extractor) super.markComplete(event, None) + super.markPartial(event, Producer.extractor) assert(super.containsEvent(event)) + assert(!super.containsEvent(mutable.Map("test" -> "123".asInstanceOf[AnyRef]))) assert(!super.containsEvent(Map("dataset" -> "d1"))) val eventStr = JSONUtil.serialize(event) val code = JSONUtil.getKey("event.vehicleCode", eventStr).textValue() val redisConnection = new RedisConnect(config.redisHost, config.redisPort, config.redisConnectionTimeout) implicit val dedupEngine = new DedupEngine(redisConnection, 2, 200) - val isDup = super.isDuplicate("D1", Option("event.id"), eventStr, context, config) + val isDup = super.isDuplicate("D1", Option("event.id"), eventStr) code match { case "HYUN-CRE-D6" => assert(!isDup) case "HYUN-CRE-D7" => assert(isDup) } - assert(!super.isDuplicate("D1", None, eventStr, context, config)) - assert(!super.isDuplicate("D1", Option("mid"), eventStr, context, config)) - assert(!super.isDuplicate("D1", Option("event"), eventStr, context, config)) + assert(!super.isDuplicate("D1", None, eventStr)) + assert(!super.isDuplicate("D1", Option("mid"), eventStr)) + assert(!super.isDuplicate("D1", Option("event"), eventStr)) } } diff --git a/framework/src/test/scala/org/sunbird/spec/TestMapStreamTask.scala b/framework/src/test/scala/org/sunbird/spec/TestMapStreamTask.scala index b64f2468..09ce0c90 100644 --- a/framework/src/test/scala/org/sunbird/spec/TestMapStreamTask.scala +++ b/framework/src/test/scala/org/sunbird/spec/TestMapStreamTask.scala @@ -13,15 +13,17 @@ class TestMapStreamTask(config: BaseProcessTestMapConfig, kafkaConnector: FlinkK implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) val dataStream = getMapDataStream(env, config, kafkaConnector) processStream(dataStream) + val dataStream2 = getMapDataStream(env, config, List(config.inputTopic()), config.kafkaConsumerProperties(), config.inputConsumer(), kafkaConnector) env.execute(config.jobName) } override def processStream(dataStream: DataStream[mutable.Map[String, AnyRef]]): DataStream[mutable.Map[String, AnyRef]] = { val stream = dataStream.process(new TestMapStreamFunc(config)) stream.getSideOutput(config.mapOutputTag) - .sinkTo(kafkaConnector.kafkaMapSink(config.kafkaMapOutputTopic)) + .sinkTo(kafkaConnector.kafkaMapDynamicSink()) .name("Map-Event-Producer") + addDefaultSinks(stream, config, kafkaConnector) stream.getSideOutput(config.mapOutputTag) } } diff --git a/framework/src/test/scala/org/sunbird/spec/TestStringStreamTask.scala b/framework/src/test/scala/org/sunbird/spec/TestStringStreamTask.scala index 116efbe3..5bc1cc1f 100644 --- a/framework/src/test/scala/org/sunbird/spec/TestStringStreamTask.scala +++ b/framework/src/test/scala/org/sunbird/spec/TestStringStreamTask.scala @@ -1,26 +1,24 @@ package org.sunbird.spec -import org.apache.flink.api.java.functions.KeySelector import org.apache.flink.api.scala.createTypeInformation import org.apache.flink.streaming.api.datastream.DataStream import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.sunbird.obsrv.core.streaming.{BaseStreamTask, FlinkKafkaConnector} import org.sunbird.obsrv.core.util.FlinkUtil -import scala.collection.mutable - class TestStringStreamTask(config: BaseProcessTestConfig, kafkaConnector: FlinkKafkaConnector) extends BaseStreamTask[String] { override def process(): Unit = { implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) val dataStream = getStringDataStream(env, config, kafkaConnector) processStream(dataStream) + val dataStream2 = getStringDataStream(env, config, List(config.inputTopic()), config.kafkaConsumerProperties(), config.inputConsumer(), kafkaConnector) env.execute(config.jobName) } override def processStream(dataStream: DataStream[String]): DataStream[String] = { val stream = dataStream.process(new TestStringStreamFunc(config)).name("TestStringEventStream") stream.getSideOutput(config.stringOutputTag) - .sinkTo(kafkaConnector.kafkaStringSink(config.kafkaStringOutputTopic)) + .sinkTo(kafkaConnector.kafkaSink[String](config.kafkaStringOutputTopic)) .name("String-Event-Producer") stream.getSideOutput(config.stringOutputTag) diff --git a/pipeline/denormalizer/pom.xml b/pipeline/denormalizer/pom.xml index f3462d71..2df98cd3 100644 --- a/pipeline/denormalizer/pom.xml +++ b/pipeline/denormalizer/pom.xml @@ -4,9 +4,6 @@ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 4.0.0 - - 3.0.1 - org.sunbird.obsrv @@ -45,6 +42,18 @@ dataset-registry 1.0.0 + + org.apache.kafka + kafka-clients + ${kafka.version} + test + + + org.apache.kafka + kafka_${scala.maj.version} + ${kafka.version} + test + org.sunbird.obsrv framework @@ -52,6 +61,13 @@ test-jar test + + org.sunbird.obsrv + dataset-registry + 1.0.0 + test-jar + test + org.apache.flink flink-test-utils @@ -66,9 +82,21 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 + test + + + io.github.embeddedkafka + embedded-kafka_2.12 + 3.4.0 + test + + + io.zonky.test + embedded-postgres + 2.0.3 test diff --git a/pipeline/denormalizer/src/main/resources/de-normalization.conf b/pipeline/denormalizer/src/main/resources/de-normalization.conf index 1272058b..63b5bc9e 100644 --- a/pipeline/denormalizer/src/main/resources/de-normalization.conf +++ b/pipeline/denormalizer/src/main/resources/de-normalization.conf @@ -3,7 +3,7 @@ include "baseconfig.conf" kafka { input.topic = ${job.env}".unique" output.denorm.topic = ${job.env}".denorm" - output.denorm.failed.topic = ${job.env}".denorm.failed" + output.denorm.failed.topic = ${job.env}".failed" groupId = ${job.env}"-denormalizer-group" } diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerFunction.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerFunction.scala index d294009c..45a41c67 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerFunction.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerFunction.scala @@ -3,25 +3,28 @@ package org.sunbird.obsrv.denormalizer.functions import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.ProcessFunction import org.slf4j.LoggerFactory -import org.sunbird.obsrv.core.exception.ObsrvException -import org.sunbird.obsrv.core.streaming.{BaseProcessFunction, Metrics, MetricsList} -import org.sunbird.obsrv.core.util.Util +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model.Producer.Producer +import org.sunbird.obsrv.core.model.StatusCode.StatusCode +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.Metrics +import org.sunbird.obsrv.core.util.{JSONUtil, Util} import org.sunbird.obsrv.denormalizer.task.DenormalizerConfig -import org.sunbird.obsrv.denormalizer.util.DenormCache +import org.sunbird.obsrv.denormalizer.util.{DenormCache, DenormEvent} +import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.registry.DatasetRegistry +import org.sunbird.obsrv.streaming.BaseDatasetProcessFunction import scala.collection.mutable -class DenormalizerFunction(config: DenormalizerConfig) - extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { +class DenormalizerFunction(config: DenormalizerConfig) extends BaseDatasetProcessFunction(config) { private[this] val logger = LoggerFactory.getLogger(classOf[DenormalizerFunction]) private[this] var denormCache: DenormCache = _ - override def getMetricsList(): MetricsList = { - val metrics = List(config.denormSuccess, config.denormTotal, config.denormFailed, config.eventsSkipped) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + override def getMetrics(): List[String] = { + List(config.denormSuccess, config.denormTotal, config.denormFailed, config.eventsSkipped, config.denormPartialSuccess) } override def open(parameters: Configuration): Unit = { @@ -35,29 +38,64 @@ class DenormalizerFunction(config: DenormalizerConfig) denormCache.close() } - override def processElement(msg: mutable.Map[String, AnyRef], + override def processElement(dataset: Dataset, msg: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - val datasetId = msg(config.CONST_DATASET).asInstanceOf[String] // DatasetId cannot be empty at this stage - metrics.incCounter(datasetId, config.denormTotal) - val dataset = DatasetRegistry.getDataset(datasetId).get - val event = Util.getMutableMap(msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) - + metrics.incCounter(dataset.id, config.denormTotal) + denormCache.open(dataset) if (dataset.denormConfig.isDefined) { - try { - msg.put(config.CONST_EVENT, denormCache.denormEvent(datasetId, event, dataset.denormConfig.get.denormFields).toMap) - metrics.incCounter(datasetId, config.denormSuccess) - context.output(config.denormEventsTag, markSuccess(msg, config.jobName)) - } catch { - case ex: ObsrvException => - metrics.incCounter(datasetId, config.denormFailed) - context.output(config.denormFailedTag, markFailed(msg, ex.error, config.jobName)) + val event = DenormEvent(msg) + val denormEvent = denormCache.denormEvent(dataset.id, event, dataset.denormConfig.get.denormFields) + val status = getDenormStatus(denormEvent) + context.output(config.denormEventsTag, markStatus(denormEvent.msg, Producer.denorm, status)) + status match { + case StatusCode.success => metrics.incCounter(dataset.id, config.denormSuccess) + case _ => + metrics.incCounter(dataset.id, if (status == StatusCode.partial) config.denormPartialSuccess else config.denormFailed) + generateSystemEvent(dataset, denormEvent, context) + logData(dataset.id, denormEvent) } } else { - metrics.incCounter(datasetId, config.eventsSkipped) - context.output(config.denormEventsTag, markSkipped(msg, config.jobName)) + metrics.incCounter(dataset.id, config.eventsSkipped) + context.output(config.denormEventsTag, markSkipped(msg, Producer.denorm)) } } + private def logData(datasetId: String, denormEvent: DenormEvent): Unit = { + logger.warn(s"Denormalizer | Denorm operation is not successful | dataset=$datasetId | denormStatus=${JSONUtil.serialize(denormEvent.fieldStatus)}") + } + + private def generateSystemEvent(dataset: Dataset, denormEvent: DenormEvent, context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Unit = { + + denormEvent.fieldStatus.filter(f => !f._2.success).groupBy(f => f._2.error.get).map(f => (f._1, f._2.size)) + .foreach(f => { + val functionalError = f._1 match { + case ErrorConstants.DENORM_KEY_MISSING => FunctionalError.DenormKeyMissing + case ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER => FunctionalError.DenormKeyInvalid + case ErrorConstants.DENORM_DATA_NOT_FOUND => FunctionalError.DenormDataNotFound + } + context.output(config.systemEventsOutputTag, JSONUtil.serialize(SystemEvent( + EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.denorm)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData(error = Some(ErrorLog(pdata_id = Producer.denorm, pdata_status = StatusCode.failed, error_type = functionalError, error_code = f._1.errorCode, error_message = f._1.errorMsg, error_level = ErrorLevel.critical, error_count = Some(f._2)))) + ))) + }) + } + + private def getDenormStatus(denormEvent: DenormEvent): StatusCode = { + val totalFieldsCount = denormEvent.fieldStatus.size + val successCount = denormEvent.fieldStatus.values.count(f => f.success) + if (totalFieldsCount == successCount) StatusCode.success else if (successCount > 0) StatusCode.partial else StatusCode.failed + + } + + private def markStatus(event: mutable.Map[String, AnyRef], producer: Producer, status: StatusCode): mutable.Map[String, AnyRef] = { + val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) + addFlags(obsrvMeta, Map(producer.toString -> status.toString)) + addTimespan(obsrvMeta, producer) + event.put("obsrv_meta", obsrvMeta.toMap) + event + } + } diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerWindowFunction.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerWindowFunction.scala index bf8ebf3d..ce603520 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerWindowFunction.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/functions/DenormalizerWindowFunction.scala @@ -5,26 +5,29 @@ import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction import org.apache.flink.streaming.api.windowing.windows.TimeWindow import org.slf4j.LoggerFactory -import org.sunbird.obsrv.core.streaming.{Metrics, MetricsList, WindowBaseProcessFunction} +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model.Producer.Producer +import org.sunbird.obsrv.core.model.StatusCode.StatusCode +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.Metrics +import org.sunbird.obsrv.core.util.{JSONUtil, Util} import org.sunbird.obsrv.denormalizer.task.DenormalizerConfig import org.sunbird.obsrv.denormalizer.util._ import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.registry.DatasetRegistry +import org.sunbird.obsrv.streaming.BaseDatasetWindowProcessFunction -import java.lang -import scala.collection.JavaConverters._ import scala.collection.mutable class DenormalizerWindowFunction(config: DenormalizerConfig)(implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) - extends WindowBaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String](config) { + extends BaseDatasetWindowProcessFunction(config) { private[this] val logger = LoggerFactory.getLogger(classOf[DenormalizerWindowFunction]) private[this] var denormCache: DenormCache = _ - override def getMetricsList(): MetricsList = { - val metrics = List(config.denormSuccess, config.denormTotal, config.denormFailed, config.eventsSkipped) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + override def getMetrics(): List[String] = { + List(config.denormSuccess, config.denormTotal, config.denormFailed, config.eventsSkipped, config.denormPartialSuccess) } override def open(parameters: Configuration): Unit = { @@ -38,21 +41,20 @@ class DenormalizerWindowFunction(config: DenormalizerConfig)(implicit val eventT denormCache.close() } - override def process(datasetId: String, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: lang.Iterable[mutable.Map[String, AnyRef]], metrics: Metrics): Unit = { + override def processWindow(dataset: Dataset, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: List[mutable.Map[String, AnyRef]], metrics: Metrics): Unit = { - val eventsList = elements.asScala.toList - metrics.incCounter(datasetId, config.denormTotal, eventsList.size.toLong) - val dataset = DatasetRegistry.getDataset(datasetId).get - val denormEvents = eventsList.map(msg => { - DenormEvent(msg, None, None) + metrics.incCounter(dataset.id, config.denormTotal, elements.size.toLong) + denormCache.open(dataset) + val denormEvents = elements.map(msg => { + DenormEvent(msg) }) if (dataset.denormConfig.isDefined) { denormalize(denormEvents, dataset, metrics, context) } else { - metrics.incCounter(datasetId, config.eventsSkipped, eventsList.size.toLong) - eventsList.foreach(msg => { - context.output(config.denormEventsTag, markSkipped(msg, config.jobName)) + metrics.incCounter(dataset.id, config.eventsSkipped, elements.size.toLong) + elements.foreach(msg => { + context.output(config.denormEventsTag, markSkipped(msg, Producer.denorm)) }) } } @@ -61,16 +63,54 @@ class DenormalizerWindowFunction(config: DenormalizerConfig)(implicit val eventT context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context): Unit = { val datasetId = dataset.id - val denormEvents = denormCache.denormMultipleEvents(datasetId, events, dataset.denormConfig.get.denormFields) denormEvents.foreach(denormEvent => { - if (denormEvent.error.isEmpty) { - metrics.incCounter(datasetId, config.denormSuccess) - context.output(config.denormEventsTag, markSuccess(denormEvent.msg, config.jobName)) - } else { - metrics.incCounter(datasetId, config.denormFailed) - context.output(config.denormFailedTag, markFailed(denormEvent.msg, denormEvent.error.get, config.jobName)) + val status = getDenormStatus(denormEvent) + context.output(config.denormEventsTag, markStatus(denormEvent.msg, Producer.denorm, status.toString)) + status match { + case StatusCode.success => metrics.incCounter(dataset.id, config.denormSuccess) + case _ => + metrics.incCounter(dataset.id, if (status == StatusCode.partial) config.denormPartialSuccess else config.denormFailed) + generateSystemEvent(dataset, denormEvent, context) + logData(dataset.id, denormEvent) } }) } + + private def logData(datasetId: String, denormEvent: DenormEvent): Unit = { + logger.warn(s"Denormalizer | Denorm operation is not successful | dataset=$datasetId | denormStatus=${JSONUtil.serialize(denormEvent.fieldStatus)}") + } + + private def generateSystemEvent(dataset: Dataset, denormEvent: DenormEvent, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context): Unit = { + + denormEvent.fieldStatus.filter(f => !f._2.success).groupBy(f => f._2.error.get).map(f => (f._1, f._2.size)) + .foreach(f => { + val functionalError = f._1 match { + case ErrorConstants.DENORM_KEY_MISSING => FunctionalError.DenormKeyMissing + case ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER => FunctionalError.DenormKeyInvalid + case ErrorConstants.DENORM_DATA_NOT_FOUND => FunctionalError.DenormDataNotFound + } + context.output(config.systemEventsOutputTag, JSONUtil.serialize(SystemEvent( + EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.denorm)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData(error = Some(ErrorLog(pdata_id = Producer.denorm, pdata_status = StatusCode.failed, error_type = functionalError, error_code = f._1.errorCode, error_message = f._1.errorMsg, error_level = ErrorLevel.critical, error_count = Some(f._2)))) + ))) + }) + + } + + private def getDenormStatus(denormEvent: DenormEvent): StatusCode = { + val totalFieldsCount = denormEvent.fieldStatus.size + val successCount = denormEvent.fieldStatus.values.count(f => f.success) + if (totalFieldsCount == successCount) StatusCode.success else if (successCount > 0) StatusCode.partial else StatusCode.failed + + } + + private def markStatus(event: mutable.Map[String, AnyRef], producer: Producer, status: String): mutable.Map[String, AnyRef] = { + val obsrvMeta = Util.getMutableMap(event("obsrv_meta").asInstanceOf[Map[String, AnyRef]]) + addFlags(obsrvMeta, Map(producer.toString -> status)) + addTimespan(obsrvMeta, producer) + event.put("obsrv_meta", obsrvMeta.toMap) + event + } } \ No newline at end of file diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerConfig.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerConfig.scala index 1d24793c..118c0307 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerConfig.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerConfig.scala @@ -5,9 +5,10 @@ import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.streaming.api.scala.OutputTag import org.sunbird.obsrv.core.streaming.BaseJobConfig + import scala.collection.mutable -class DenormalizerConfig(override val config: Config) extends BaseJobConfig[mutable.Map[String, AnyRef]](config, "DenormalizerJob" ) { +class DenormalizerConfig(override val config: Config) extends BaseJobConfig[mutable.Map[String, AnyRef]](config, "DenormalizerJob") { private val serialVersionUID = 2905979434303791379L @@ -17,23 +18,20 @@ class DenormalizerConfig(override val config: Config) extends BaseJobConfig[muta // Kafka Topics Configuration val kafkaInputTopic: String = config.getString("kafka.input.topic") val denormOutputTopic: String = config.getString("kafka.output.denorm.topic") - val denormFailedTopic: String = config.getString("kafka.output.denorm.failed.topic") // Windows val windowTime: Int = config.getInt("task.window.time.in.seconds") val windowCount: Int = config.getInt("task.window.count") val DENORM_EVENTS_PRODUCER = "denorm-events-producer" - val DENORM_FAILED_EVENTS_PRODUCER = "denorm-failed-events-producer" private val DENORM_EVENTS = "denorm_events" - private val FAILED_EVENTS = "denorm_failed_events" val denormEventsTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](DENORM_EVENTS) - val denormFailedTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](FAILED_EVENTS) - val eventsSkipped = "events-skipped" + val eventsSkipped = "denorm-skipped" val denormFailed = "denorm-failed" + val denormPartialSuccess = "denorm-partial-success" val denormSuccess = "denorm-success" val denormTotal = "denorm-total" @@ -46,5 +44,6 @@ class DenormalizerConfig(override val config: Config) extends BaseJobConfig[muta override def inputTopic(): String = kafkaInputTopic override def inputConsumer(): String = denormalizationConsumer override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = denormEventsTag + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") -} +} \ No newline at end of file diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerStreamTask.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerStreamTask.scala index 9d620a8f..b23cd612 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerStreamTask.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerStreamTask.scala @@ -22,22 +22,25 @@ class DenormalizerStreamTask(config: DenormalizerConfig, kafkaConnector: FlinkKa def process(): Unit = { implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) - val dataStream = getMapDataStream(env, config, kafkaConnector) - processStream(dataStream) + process(env) env.execute(config.jobName) } // $COVERAGE-ON$ + def process(env: StreamExecutionEnvironment): Unit = { + val dataStream = getMapDataStream(env, config, kafkaConnector) + processStream(dataStream) + } + override def processStream(dataStream: DataStream[mutable.Map[String, AnyRef]]): DataStream[mutable.Map[String, AnyRef]] = { val denormStream = dataStream .process(new DenormalizerFunction(config)).name(config.denormalizationFunction).uid(config.denormalizationFunction) .setParallelism(config.downstreamOperatorsParallelism) - denormStream.getSideOutput(config.denormEventsTag).sinkTo(kafkaConnector.kafkaMapSink(config.denormOutputTopic)) + denormStream.getSideOutput(config.denormEventsTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.denormOutputTopic)) .name(config.DENORM_EVENTS_PRODUCER).uid(config.DENORM_EVENTS_PRODUCER).setParallelism(config.downstreamOperatorsParallelism) - denormStream.getSideOutput(config.denormFailedTag).sinkTo(kafkaConnector.kafkaMapSink(config.denormFailedTopic)) - .name(config.DENORM_FAILED_EVENTS_PRODUCER).uid(config.DENORM_FAILED_EVENTS_PRODUCER).setParallelism(config.downstreamOperatorsParallelism) + addDefaultSinks(denormStream, config, kafkaConnector) denormStream.getSideOutput(config.successTag()) } } diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerWindowStreamTask.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerWindowStreamTask.scala index 81fb2368..73b6c6ec 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerWindowStreamTask.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/task/DenormalizerWindowStreamTask.scala @@ -9,7 +9,7 @@ import org.apache.flink.streaming.api.datastream.WindowedStream import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.api.windowing.time.Time import org.apache.flink.streaming.api.windowing.windows.TimeWindow -import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.streaming.{BaseStreamTaskSink, FlinkKafkaConnector} import org.sunbird.obsrv.core.util.{DatasetKeySelector, FlinkUtil, TumblingProcessingTimeCountWindows} import org.sunbird.obsrv.denormalizer.functions.DenormalizerWindowFunction @@ -19,15 +19,21 @@ import scala.collection.mutable /** * Denormalization stream task does the following pipeline processing in a sequence: */ -class DenormalizerWindowStreamTask(config: DenormalizerConfig, kafkaConnector: FlinkKafkaConnector) { +class DenormalizerWindowStreamTask(config: DenormalizerConfig, kafkaConnector: FlinkKafkaConnector) extends BaseStreamTaskSink[mutable.Map[String, AnyRef]] { private val serialVersionUID = -7729362727131516112L + // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster def process(): Unit = { implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) - implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) + process(env) + env.execute(config.jobName) + } + // $COVERAGE-ON$ + def process(env: StreamExecutionEnvironment): Unit = { + implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) val source = kafkaConnector.kafkaMapSource(config.inputTopic()) val windowedStream: WindowedStream[mutable.Map[String, AnyRef], String, TimeWindow] = env.fromSource(source, WatermarkStrategy.noWatermarks[mutable.Map[String, AnyRef]](), config.denormalizationConsumer).uid(config.denormalizationConsumer) .setParallelism(config.kafkaConsumerParallelism).rebalance() @@ -35,16 +41,15 @@ class DenormalizerWindowStreamTask(config: DenormalizerConfig, kafkaConnector: F .window(TumblingProcessingTimeCountWindows.of(Time.seconds(config.windowTime), config.windowCount)) val denormStream = windowedStream - .process(new DenormalizerWindowFunction(config)).name(config.denormalizationFunction).uid(config.denormalizationFunction) - .setParallelism(config.downstreamOperatorsParallelism) + .process(new DenormalizerWindowFunction(config)).name(config.denormalizationFunction).uid(config.denormalizationFunction) + .setParallelism(config.downstreamOperatorsParallelism) - denormStream.getSideOutput(config.denormEventsTag).sinkTo(kafkaConnector.kafkaMapSink(config.denormOutputTopic)) + denormStream.getSideOutput(config.denormEventsTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.denormOutputTopic)) .name(config.DENORM_EVENTS_PRODUCER).uid(config.DENORM_EVENTS_PRODUCER).setParallelism(config.downstreamOperatorsParallelism) - denormStream.getSideOutput(config.denormFailedTag).sinkTo(kafkaConnector.kafkaMapSink(config.denormFailedTopic)) - .name(config.DENORM_FAILED_EVENTS_PRODUCER).uid(config.DENORM_FAILED_EVENTS_PRODUCER).setParallelism(config.downstreamOperatorsParallelism) - env.execute(config.jobName) + addDefaultSinks(denormStream, config, kafkaConnector) } + } // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster diff --git a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/util/DenormCache.scala b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/util/DenormCache.scala index 2e3aa3a1..dd94a251 100644 --- a/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/util/DenormCache.scala +++ b/pipeline/denormalizer/src/main/scala/org/sunbird/obsrv/denormalizer/util/DenormCache.scala @@ -1,8 +1,6 @@ package org.sunbird.obsrv.denormalizer.util -import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.cache.RedisConnect -import org.sunbird.obsrv.core.exception.ObsrvException import org.sunbird.obsrv.core.model.ErrorConstants import org.sunbird.obsrv.core.model.ErrorConstants.Error import org.sunbird.obsrv.core.util.{JSONUtil, Util} @@ -12,11 +10,12 @@ import redis.clients.jedis.{Pipeline, Response} import scala.collection.mutable -case class DenormEvent(msg: mutable.Map[String, AnyRef], var responses: Option[mutable.Map[String, Response[String]]], var error: Option[Error]) +case class DenormFieldStatus(fieldValue: String, var success: Boolean, var error: Option[Error]) + +case class DenormEvent(msg: mutable.Map[String, AnyRef], var responses: Option[mutable.Map[String, Response[String]]] = None, var fieldStatus: mutable.Map[String, DenormFieldStatus] = mutable.Map[String, DenormFieldStatus]()) class DenormCache(val config: DenormalizerConfig) { - private[this] val logger = LoggerFactory.getLogger(classOf[DenormCache]) private val datasetPipelineMap: mutable.Map[String, Pipeline] = mutable.Map[String, Pipeline]() def close(): Unit = { @@ -25,25 +24,42 @@ class DenormCache(val config: DenormalizerConfig) { def open(datasets: List[Dataset]): Unit = { datasets.map(dataset => { - if (dataset.denormConfig.isDefined) { - val denormConfig = dataset.denormConfig.get - val redisConnect = new RedisConnect(denormConfig.redisDBHost, denormConfig.redisDBPort, config.redisConnectionTimeout) - val pipeline: Pipeline = redisConnect.getConnection(0).pipelined() - datasetPipelineMap.put(dataset.id, pipeline) - } + open(dataset) }) } - def denormEvent(datasetId: String, event: mutable.Map[String, AnyRef], denormFieldConfigs: List[DenormFieldConfig]): mutable.Map[String, AnyRef] = { - val pipeline = this.datasetPipelineMap(datasetId) - pipeline.clear() + def open(dataset: Dataset): Unit = { + if (!datasetPipelineMap.contains(dataset.id) && dataset.denormConfig.isDefined) { + val denormConfig = dataset.denormConfig.get + val redisConnect = new RedisConnect(denormConfig.redisDBHost, denormConfig.redisDBPort, config.redisConnectionTimeout) + val pipeline: Pipeline = redisConnect.getConnection(0).pipelined() + datasetPipelineMap.put(dataset.id, pipeline) + } + } + + private def processDenorm(denormEvent: DenormEvent, pipeline: Pipeline, denormFieldConfigs: List[DenormFieldConfig]): Unit = { + val responses: mutable.Map[String, Response[String]] = mutable.Map[String, Response[String]]() + val fieldStatus: mutable.Map[String, DenormFieldStatus] = mutable.Map[String, DenormFieldStatus]() + val event = Util.getMutableMap(denormEvent.msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) val eventStr = JSONUtil.serialize(event) denormFieldConfigs.foreach(fieldConfig => { - responses.put(fieldConfig.denormOutField, getFromCache(pipeline, fieldConfig, eventStr)) + val denormFieldStatus = extractField(fieldConfig, eventStr) + fieldStatus.put(fieldConfig.denormOutField, denormFieldStatus) + if (!denormFieldStatus.fieldValue.isBlank) { + responses.put(fieldConfig.denormOutField, getFromCache(pipeline, denormFieldStatus.fieldValue, fieldConfig)) + } }) + denormEvent.fieldStatus = fieldStatus + denormEvent.responses = Some(responses) + } + + def denormEvent(datasetId: String, denormEvent: DenormEvent, denormFieldConfigs: List[DenormFieldConfig]): DenormEvent = { + val pipeline = this.datasetPipelineMap(datasetId) + pipeline.clear() + processDenorm(denormEvent, pipeline, denormFieldConfigs) pipeline.sync() - updateEvent(event, responses) + updateEvent(denormEvent) } def denormMultipleEvents(datasetId: String, events: List[DenormEvent], denormFieldConfigs: List[DenormFieldConfig]): List[DenormEvent] = { @@ -51,62 +67,50 @@ class DenormCache(val config: DenormalizerConfig) { pipeline.clear() events.foreach(denormEvent => { - val responses: mutable.Map[String, Response[String]] = mutable.Map[String, Response[String]]() - val event = Util.getMutableMap(denormEvent.msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) - val eventStr = JSONUtil.serialize(event) - try { - denormFieldConfigs.foreach(fieldConfig => { - responses.put(fieldConfig.denormOutField, getFromCache(pipeline, fieldConfig, eventStr)) - }) - denormEvent.responses = Some(responses) - } catch { - case ex: ObsrvException => - logger.error("DenormCache:denormMultipleEvents() - Exception", ex) - denormEvent.error = Some(ex.error) - } + processDenorm(denormEvent, pipeline, denormFieldConfigs) }) pipeline.sync() updateMultipleEvents(events) } - private def getFromCache(pipeline: Pipeline, fieldConfig: DenormFieldConfig, eventStr: String): Response[String] = { - pipeline.select(fieldConfig.redisDB) + private def extractField(fieldConfig: DenormFieldConfig, eventStr: String): DenormFieldStatus = { val denormFieldNode = JSONUtil.getKey(fieldConfig.denormKey, eventStr) if (denormFieldNode.isMissingNode) { - throw new ObsrvException(ErrorConstants.DENORM_KEY_MISSING) - } - if (!denormFieldNode.isTextual) { - throw new ObsrvException(ErrorConstants.DENORM_KEY_NOT_A_STRING) + DenormFieldStatus("", success = false, Some(ErrorConstants.DENORM_KEY_MISSING)) + } else { + if (denormFieldNode.isTextual || denormFieldNode.isNumber) { + DenormFieldStatus(denormFieldNode.asText(), success = false, None) + } else { + DenormFieldStatus("", success = false, Some(ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER)) + } } - val denormField = denormFieldNode.asText() + } + + private def getFromCache(pipeline: Pipeline, denormField: String, fieldConfig: DenormFieldConfig): Response[String] = { + pipeline.select(fieldConfig.redisDB) pipeline.get(denormField) } - private def updateEvent(event: mutable.Map[String, AnyRef], responses: mutable.Map[String, Response[String]]): mutable.Map[String, AnyRef] = { + private def updateEvent(denormEvent: DenormEvent): DenormEvent = { - responses.map(f => { + val event = Util.getMutableMap(denormEvent.msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) + denormEvent.responses.get.foreach(f => { if (f._2.get() != null) { + denormEvent.fieldStatus(f._1).success = true event.put(f._1, JSONUtil.deserialize[Map[String, AnyRef]](f._2.get())) + } else { + denormEvent.fieldStatus(f._1).error = Some(ErrorConstants.DENORM_DATA_NOT_FOUND) } }) - event + denormEvent.msg.put(config.CONST_EVENT, event.toMap) + denormEvent } private def updateMultipleEvents(events: List[DenormEvent]): List[DenormEvent] = { events.map(denormEvent => { - if (denormEvent.responses.isDefined) { - val event = Util.getMutableMap(denormEvent.msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) - denormEvent.responses.get.map(f => { - if (f._2.get() != null) { - event.put(f._1, JSONUtil.deserialize[Map[String, AnyRef]](f._2.get())) - } - }) - denormEvent.msg.put(config.CONST_EVENT, event.toMap) - } - denormEvent + updateEvent(denormEvent) }) } - } \ No newline at end of file diff --git a/pipeline/denormalizer/src/test/resources/test.conf b/pipeline/denormalizer/src/test/resources/test.conf index 441d01bf..f7f61beb 100644 --- a/pipeline/denormalizer/src/test/resources/test.conf +++ b/pipeline/denormalizer/src/test/resources/test.conf @@ -3,12 +3,12 @@ include "base-test.conf" kafka { input.topic = "flink.unique" output.denorm.topic = "flink.denorm" - output.denorm.failed.topic = "flink.denorm.failed" + output.denorm.failed.topic = "flink.failed" groupId = "flink-denormalizer-group" } task { - window.time.in.seconds = 5 + window.time.in.seconds = 2 window.count = 30 window.shards = 1400 consumer.parallelism = 1 diff --git a/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerStreamTaskTestSpec.scala b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerStreamTaskTestSpec.scala new file mode 100644 index 00000000..bd9658eb --- /dev/null +++ b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerStreamTaskTestSpec.scala @@ -0,0 +1,176 @@ +package org.sunbird.obsrv.denormalizer + +import io.github.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig} +import org.apache.flink.configuration.Configuration +import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.test.util.MiniClusterWithClientResource +import org.apache.kafka.common.serialization.StringDeserializer +import org.scalatest.Matchers._ +import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.cache.RedisConnect +import org.sunbird.obsrv.core.model.Models.SystemEvent +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} +import org.sunbird.obsrv.denormalizer.task.{DenormalizerConfig, DenormalizerStreamTask} +import org.sunbird.obsrv.denormalizer.util.DenormCache +import org.sunbird.obsrv.model.DatasetModels._ +import org.sunbird.obsrv.model.DatasetStatus +import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry + +import scala.collection.mutable +import scala.concurrent.ExecutionContext.Implicits.global +import scala.concurrent.Future +import scala.concurrent.duration._ + +class DenormalizerStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { + + val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() + .setConfiguration(testConfiguration()) + .setNumberSlotsPerTaskManager(1) + .setNumberTaskManagers(1) + .build) + + val denormConfig = new DenormalizerConfig(config) + val redisPort: Int = denormConfig.redisPort + val kafkaConnector = new FlinkKafkaConnector(denormConfig) + val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") + implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = + EmbeddedKafkaConfig( + kafkaPort = 9093, + zooKeeperPort = 2183, + customConsumerProperties = customKafkaConsumerProperties + ) + implicit val deserializer: StringDeserializer = new StringDeserializer() + + def testConfiguration(): Configuration = { + val config = new Configuration() + config.setString("metrics.reporter", "job_metrics_reporter") + config.setString("metrics.reporter.job_metrics_reporter.class", classOf[BaseMetricsReporter].getName) + config + } + + override def beforeAll(): Unit = { + super.beforeAll() + BaseMetricsReporter.gaugeMetrics.clear() + EmbeddedKafka.start()(embeddedKafkaConfig) + val postgresConnect = new PostgresConnect(postgresConfig) + insertTestData(postgresConnect) + postgresConnect.closeConnection() + createTestTopics() + publishMessagesToKafka() + flinkCluster.before() + } + + private def publishMessagesToKafka(): Unit = { + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.SUCCESS_DENORM) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.SKIP_DENORM) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.DENORM_MISSING_KEYS) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.DENORM_MISSING_DATA_AND_INVALIDKEY) + } + + private def insertTestData(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("update datasets set denorm_config = '" + s"""{"redis_db_host":"localhost","redis_db_port":$redisPort,"denorm_fields":[{"denorm_key":"vehicleCode","redis_db":3,"denorm_out_field":"vehicle_data"},{"denorm_key":"dealer.dealerCode","redis_db":4,"denorm_out_field":"dealer_data"}]}""" + "' where id='d1';") + val redisConnection = new RedisConnect(denormConfig.redisHost, denormConfig.redisPort, denormConfig.redisConnectionTimeout) + redisConnection.getConnection(3).set("HYUN-CRE-D6", EventFixture.DENORM_DATA_1) + redisConnection.getConnection(4).set("D123", EventFixture.DENORM_DATA_2) + } + + override def afterAll(): Unit = { + val redisConnection = new RedisConnect(denormConfig.redisHost, denormConfig.redisPort, denormConfig.redisConnectionTimeout) + redisConnection.getConnection(3).flushAll() + redisConnection.getConnection(4).flushAll() + + super.afterAll() + flinkCluster.after() + EmbeddedKafka.stop() + } + + def createTestTopics(): Unit = { + List( + config.getString("kafka.output.system.event.topic"), config.getString("kafka.output.denorm.topic"), config.getString("kafka.input.topic") + ).foreach(EmbeddedKafka.createCustomTopic(_)) + } + + "DenormalizerStreamTaskTestSpec" should "validate the denorm stream task" in { + + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(denormConfig) + val task = new DenormalizerStreamTask(denormConfig, kafkaConnector) + task.process(env) + Future { + env.execute(denormConfig.jobName) + } + + val outputs = EmbeddedKafka.consumeNumberMessagesFrom[String](denormConfig.denormOutputTopic, 4, timeout = 30.seconds) + validateOutputs(outputs) + + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](denormConfig.kafkaSystemTopic, 3, timeout = 30.seconds) + validateSystemEvents(systemEvents) + + val mutableMetricsMap = mutable.Map[String, Long]() + BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### DenormalizerStreamTaskTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + validateMetrics(mutableMetricsMap) + } + + it should "validate dynamic cache creation within DenormCache" in { + val denormCache = new DenormCache(denormConfig) + noException should be thrownBy { + denormCache.open(Dataset(id = "d123", datasetType = "dataset", extractionConfig = None, dedupConfig = None, validationConfig = None, jsonSchema = None, + denormConfig = Some(DenormConfig(redisDBHost = "localhost", redisDBPort = redisPort, denormFields = List(DenormFieldConfig(denormKey = "vehicleCode", redisDB = 3, denormOutField = "vehicle_data")))), routerConfig = RouterConfig(""), + datasetConfig = DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest"), status = DatasetStatus.Live)) + } + } + + private def validateOutputs(outputs: List[String]): Unit = { + outputs.size should be(4) + outputs.zipWithIndex.foreach { + case (elem, idx) => + val msg = JSONUtil.deserialize[Map[String, AnyRef]](elem) + val event = JSONUtil.serialize(msg(Constants.EVENT)) + idx match { + case 0 => event should be("""{"vehicle_data":{"model":"Creta","price":"2200000","variant":"SX(O)","fuel":"Diesel","code":"HYUN-CRE-D6","currencyCode":"INR","currency":"Indian Rupee","manufacturer":"Hyundai","modelYear":"2023","transmission":"automatic"},"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","dealer_data":{"code":"D123","name":"KUN United","licenseNumber":"1234124","authorized":"yes"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 1 => event should be("""{"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 2 => event should be("""{"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"id":"2345","date":"2023-03-01","dealer_data":{"code":"D123","name":"KUN United","licenseNumber":"1234124","authorized":"yes"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 3 => event should be("""{"dealer":{"dealerCode":"D124","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":["HYUN-CRE-D7"],"id":"4567","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + } + } + } + + private def validateSystemEvents(systemEvents: List[String]): Unit = { + systemEvents.size should be(3) + systemEvents.foreach(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + event.etype should be(EventID.METRIC) + event.ctx.module should be(ModuleID.processing) + event.ctx.pdata.id should be(denormConfig.jobName) + event.ctx.pdata.`type` should be(PDataType.flink) + event.ctx.pdata.pid.get should be(Producer.denorm) + event.data.error.isDefined should be(true) + val errorLog = event.data.error.get + errorLog.error_level should be(ErrorLevel.critical) + errorLog.pdata_id should be(Producer.denorm) + errorLog.pdata_status should be(StatusCode.failed) + errorLog.error_count.get should be(1) + errorLog.error_code match { + case ErrorConstants.DENORM_KEY_MISSING.errorCode => + errorLog.error_type should be(FunctionalError.DenormKeyMissing) + case ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER.errorCode => + errorLog.error_type should be(FunctionalError.DenormKeyInvalid) + case ErrorConstants.DENORM_DATA_NOT_FOUND.errorCode => + errorLog.error_type should be(FunctionalError.DenormDataNotFound) + } + }) + } + + private def validateMetrics(mutableMetricsMap: mutable.Map[String, Long]): Unit = { + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormTotal}") should be(3) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormFailed}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormSuccess}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormPartialSuccess}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d2.${denormConfig.denormTotal}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d2.${denormConfig.eventsSkipped}") should be(1) + } + +} diff --git a/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerWindowStreamTaskTestSpec.scala b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerWindowStreamTaskTestSpec.scala new file mode 100644 index 00000000..52d06e8b --- /dev/null +++ b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/DenormalizerWindowStreamTaskTestSpec.scala @@ -0,0 +1,213 @@ +package org.sunbird.obsrv.denormalizer + +import io.github.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig} +import org.apache.flink.configuration.Configuration +import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.test.util.MiniClusterWithClientResource +import org.apache.kafka.common.serialization.StringDeserializer +import org.scalatest.Matchers._ +import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.cache.RedisConnect +import org.sunbird.obsrv.core.model.Models.SystemEvent +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} +import org.sunbird.obsrv.denormalizer.task.{DenormalizerConfig, DenormalizerWindowStreamTask} +import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry + +import scala.collection.mutable +import scala.concurrent.ExecutionContext.Implicits.global +import scala.concurrent.Future +import scala.concurrent.duration._ + +class DenormalizerWindowStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { + + val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() + .setConfiguration(testConfiguration()) + .setNumberSlotsPerTaskManager(1) + .setNumberTaskManagers(1) + .build) + + val denormConfig = new DenormalizerConfig(config) + val redisPort: Int = denormConfig.redisPort + val kafkaConnector = new FlinkKafkaConnector(denormConfig) + val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") + implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = + EmbeddedKafkaConfig( + kafkaPort = 9093, + zooKeeperPort = 2183, + customConsumerProperties = customKafkaConsumerProperties + ) + implicit val deserializer: StringDeserializer = new StringDeserializer() + + def testConfiguration(): Configuration = { + val config = new Configuration() + config.setString("metrics.reporter", "job_metrics_reporter") + config.setString("metrics.reporter.job_metrics_reporter.class", classOf[BaseMetricsReporter].getName) + config + } + + override def beforeAll(): Unit = { + super.beforeAll() + BaseMetricsReporter.gaugeMetrics.clear() + EmbeddedKafka.start()(embeddedKafkaConfig) + val postgresConnect = new PostgresConnect(postgresConfig) + insertTestData(postgresConnect) + postgresConnect.closeConnection() + createTestTopics() + publishMessagesToKafka() + flinkCluster.before() + } + + private def publishMessagesToKafka(): Unit = { + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.SUCCESS_DENORM) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.SKIP_DENORM) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.DENORM_MISSING_KEYS) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.DENORM_MISSING_DATA_AND_INVALIDKEY) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.INVALID_DATASET_ID) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.MISSING_EVENT_KEY) + } + + private def insertTestData(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, extraction_config, dedup_config, router_config, dataset_config, status, data_version, created_by, updated_by, created_date, updated_date) values ('d3', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"validate\": true, \"mode\": \"Strict\"}', '{\"is_batch_event\": true, \"extraction_key\": \"events\", \"dedup_config\": {\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}}', '{\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}', '{\"topic\":\"d1-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+config.getInt("redis.port")+",\"redis_db\":2}', 'Live', 2, 'System', 'System', now(), now());") + postgresConnect.execute("update datasets set denorm_config = '" + s"""{"redis_db_host":"localhost","redis_db_port":$redisPort,"denorm_fields":[{"denorm_key":"vehicleCode","redis_db":3,"denorm_out_field":"vehicle_data"},{"denorm_key":"dealer.dealerCode","redis_db":4,"denorm_out_field":"dealer_data"}]}""" + "' where id='d1';") + val redisConnection = new RedisConnect(denormConfig.redisHost, denormConfig.redisPort, denormConfig.redisConnectionTimeout) + redisConnection.getConnection(3).set("HYUN-CRE-D6", EventFixture.DENORM_DATA_1) + redisConnection.getConnection(4).set("D123", EventFixture.DENORM_DATA_2) + } + + override def afterAll(): Unit = { + val redisConnection = new RedisConnect(denormConfig.redisHost, denormConfig.redisPort, denormConfig.redisConnectionTimeout) + redisConnection.getConnection(3).flushAll() + redisConnection.getConnection(4).flushAll() + + super.afterAll() + flinkCluster.after() + EmbeddedKafka.stop() + } + + def createTestTopics(): Unit = { + List( + config.getString("kafka.output.system.event.topic"), config.getString("kafka.output.denorm.topic"), config.getString("kafka.input.topic") + ).foreach(EmbeddedKafka.createCustomTopic(_)) + } + + "DenormalizerWindowStreamTaskTestSpec" should "validate the denorm window stream task" in { + + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(denormConfig) + val task = new DenormalizerWindowStreamTask(denormConfig, kafkaConnector) + task.process(env) + Future { + env.execute(denormConfig.jobName) + } + + val outputs = EmbeddedKafka.consumeNumberMessagesFrom[String](denormConfig.denormOutputTopic, 4, timeout = 30.seconds) + validateOutputs(outputs) + + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](denormConfig.kafkaSystemTopic, 5, timeout = 30.seconds) + validateSystemEvents(systemEvents) + + val mutableMetricsMap = mutable.Map[String, Long]() + BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### DenormalizerStreamWindowTaskTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + validateMetrics(mutableMetricsMap) + } + + private def validateOutputs(outputs: List[String]): Unit = { + outputs.size should be(4) + outputs.zipWithIndex.foreach { + case (elem, idx) => + //TODO: Add validations for obsrv_meta + val msg = JSONUtil.deserialize[Map[String, AnyRef]](elem) + val event = JSONUtil.serialize(msg(Constants.EVENT)) + idx match { + case 0 => event should be("""{"vehicle_data":{"model":"Creta","price":"2200000","variant":"SX(O)","fuel":"Diesel","code":"HYUN-CRE-D6","currencyCode":"INR","currency":"Indian Rupee","manufacturer":"Hyundai","modelYear":"2023","transmission":"automatic"},"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","dealer_data":{"code":"D123","name":"KUN United","licenseNumber":"1234124","authorized":"yes"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 1 => event should be("""{"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"id":"2345","date":"2023-03-01","dealer_data":{"code":"D123","name":"KUN United","licenseNumber":"1234124","authorized":"yes"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 2 => event should be("""{"dealer":{"dealerCode":"D124","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":["HYUN-CRE-D7"],"id":"4567","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + case 3 => event should be("""{"dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""") + } + + } + } + + private def validateSystemEvents(systemEvents: List[String]): Unit = { + systemEvents.size should be(5) + systemEvents.count(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + Producer.validator.equals(event.ctx.pdata.pid.get) + }) should be (2) + systemEvents.count(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + FunctionalError.MissingEventData.equals(event.data.error.get.error_type) + }) should be(1) + systemEvents.count(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + Producer.denorm.equals(event.ctx.pdata.pid.get) + }) should be(3) + + systemEvents.foreach(se => { + val event = JSONUtil.deserialize[SystemEvent](se) + val error = event.data.error + if (event.ctx.dataset.getOrElse("ALL").equals("ALL")) + event.ctx.dataset_type should be(None) + else if (error.isDefined) { + val errorCode = error.get.error_code + if (errorCode.equals(ErrorConstants.MISSING_DATASET_ID.errorCode) || + errorCode.equals(ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode) || + errorCode.equals(ErrorConstants.EVENT_MISSING.errorCode)) { + event.ctx.dataset_type should be(None) + } + } + else + event.ctx.dataset_type should be(Some("dataset")) + }) + + systemEvents.foreach(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + event.etype should be(EventID.METRIC) + event.ctx.module should be(ModuleID.processing) + event.ctx.pdata.id should be(denormConfig.jobName) + event.ctx.pdata.`type` should be(PDataType.flink) + event.data.error.isDefined should be(true) + val errorLog = event.data.error.get + errorLog.error_level should be(ErrorLevel.critical) + errorLog.pdata_status should be(StatusCode.failed) + errorLog.error_count.get should be(1) + errorLog.error_code match { + case ErrorConstants.DENORM_KEY_MISSING.errorCode => + event.ctx.pdata.pid.get should be(Producer.denorm) + errorLog.pdata_id should be(Producer.denorm) + errorLog.error_type should be(FunctionalError.DenormKeyMissing) + case ErrorConstants.DENORM_KEY_NOT_A_STRING_OR_NUMBER.errorCode => + event.ctx.pdata.pid.get should be(Producer.denorm) + errorLog.pdata_id should be(Producer.denorm) + errorLog.error_type should be(FunctionalError.DenormKeyInvalid) + case ErrorConstants.DENORM_DATA_NOT_FOUND.errorCode => + event.ctx.pdata.pid.get should be(Producer.denorm) + errorLog.pdata_id should be(Producer.denorm) + errorLog.error_type should be(FunctionalError.DenormDataNotFound) + case ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode => + event.ctx.pdata.pid.get should be(Producer.validator) + errorLog.pdata_id should be(Producer.validator) + errorLog.error_type should be(FunctionalError.MissingDatasetId) + case ErrorConstants.EVENT_MISSING.errorCode => + event.ctx.pdata.pid.get should be(Producer.validator) + errorLog.pdata_id should be(Producer.validator) + errorLog.error_type should be(FunctionalError.MissingEventData) + } + }) + } + + private def validateMetrics(mutableMetricsMap: mutable.Map[String, Long]): Unit = { + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormTotal}") should be(3) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormFailed}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormSuccess}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d1.${denormConfig.denormPartialSuccess}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d2.${denormConfig.denormTotal}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d2.${denormConfig.eventsSkipped}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.d3.${denormConfig.eventFailedMetricsCount}") should be(1) + mutableMetricsMap(s"${denormConfig.jobName}.dxyz.${denormConfig.eventFailedMetricsCount}") should be(1) + } + +} diff --git a/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/EventFixture.scala b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/EventFixture.scala new file mode 100644 index 00000000..0b1a0b01 --- /dev/null +++ b/pipeline/denormalizer/src/test/scala/org/sunbird/obsrv/denormalizer/EventFixture.scala @@ -0,0 +1,15 @@ +package org.sunbird.obsrv.denormalizer + +object EventFixture { + + val SUCCESS_DENORM = """{"dataset":"d1","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val SKIP_DENORM = """{"dataset":"d2","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + + val DENORM_MISSING_KEYS = """{"dataset":"d1","event":{"id":"2345","date":"2023-03-01","dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val DENORM_MISSING_DATA_AND_INVALIDKEY = """{"dataset":"d1","event":{"id":"4567","vehicleCode":["HYUN-CRE-D7"],"date":"2023-03-01","dealer":{"dealerCode":"D124","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val INVALID_DATASET_ID = """{"dataset":"dxyz","event":{"id":"4567","vehicleCode":["HYUN-CRE-D7"],"date":"2023-03-01","dealer":{"dealerCode":"D124","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val MISSING_EVENT_KEY = """{"dataset":"d3","event1":{"id":"4567","vehicleCode":["HYUN-CRE-D7"],"date":"2023-03-01","dealer":{"dealerCode":"D124","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + + val DENORM_DATA_1 = """{"code":"HYUN-CRE-D6","manufacturer":"Hyundai","model":"Creta","variant":"SX(O)","modelYear":"2023","price":"2200000","currencyCode":"INR","currency":"Indian Rupee","transmission":"automatic","fuel":"Diesel"}""" + val DENORM_DATA_2 = """{"code":"D123","name":"KUN United","licenseNumber":"1234124","authorized":"yes"}""" +} \ No newline at end of file diff --git a/pipeline/druid-router/pom.xml b/pipeline/druid-router/pom.xml index 4945f84d..41e2e390 100644 --- a/pipeline/druid-router/pom.xml +++ b/pipeline/druid-router/pom.xml @@ -4,9 +4,6 @@ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 4.0.0 - - 3.0.1 - org.sunbird.obsrv @@ -45,13 +42,6 @@ dataset-registry 1.0.0 - - org.sunbird.obsrv - framework - 1.0.0 - test-jar - test - com.github.java-json-tools json-schema-validator @@ -76,6 +66,32 @@ guava 32.1.2-jre + + org.sunbird.obsrv + framework + 1.0.0 + test-jar + test + + + org.sunbird.obsrv + dataset-registry + 1.0.0 + test-jar + test + + + org.apache.kafka + kafka-clients + ${kafka.version} + test + + + org.apache.kafka + kafka_${scala.maj.version} + ${kafka.version} + test + org.apache.flink flink-test-utils @@ -90,9 +106,21 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 + test + + + io.github.embeddedkafka + embedded-kafka_2.12 + 3.4.0 + test + + + io.zonky.test + embedded-postgres + 2.0.3 test @@ -166,7 +194,7 @@ - org.sunbird.obsrv.router.task.DruidRouterStreamTask + org.sunbird.obsrv.router.task.DynamicRouterStreamTask diff --git a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DruidRouterFunction.scala b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DruidRouterFunction.scala index 0cd9e4e1..d1f2c5e6 100644 --- a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DruidRouterFunction.scala +++ b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DruidRouterFunction.scala @@ -13,6 +13,8 @@ import org.sunbird.obsrv.router.task.DruidRouterConfig import scala.collection.mutable +// $COVERAGE-OFF$ Disabling scoverage as the below function is deprecated +@Deprecated class DruidRouterFunction(config: DruidRouterConfig) extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { private[this] val logger = LoggerFactory.getLogger(classOf[DruidRouterFunction]) @@ -33,18 +35,24 @@ class DruidRouterFunction(config: DruidRouterConfig) extends BaseProcessFunction override def processElement(msg: mutable.Map[String, AnyRef], ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - - implicit val mapTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) - val datasetId = msg(config.CONST_DATASET).asInstanceOf[String] // DatasetId cannot be empty at this stage - metrics.incCounter(datasetId, config.routerTotalCount) - val dataset = DatasetRegistry.getDataset(datasetId).get - val event = Util.getMutableMap(msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) - event.put(config.CONST_OBSRV_META, msg(config.CONST_OBSRV_META)) - val routerConfig = dataset.routerConfig - ctx.output(OutputTag[mutable.Map[String, AnyRef]](routerConfig.topic), event) - metrics.incCounter(datasetId, config.routerSuccessCount) - - msg.remove(config.CONST_EVENT) - ctx.output(config.statsOutputTag, markComplete(msg, dataset.dataVersion)) + try { + implicit val mapTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) + val datasetId = msg(config.CONST_DATASET).asInstanceOf[String] // DatasetId cannot be empty at this stage + metrics.incCounter(datasetId, config.routerTotalCount) + val dataset = DatasetRegistry.getDataset(datasetId).get + val event = Util.getMutableMap(msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) + event.put(config.CONST_OBSRV_META, msg(config.CONST_OBSRV_META)) + val routerConfig = dataset.routerConfig + ctx.output(OutputTag[mutable.Map[String, AnyRef]](routerConfig.topic), event) + metrics.incCounter(datasetId, config.routerSuccessCount) + + msg.remove(config.CONST_EVENT) + ctx.output(config.statsOutputTag, markComplete(msg, dataset.dataVersion)) + } catch { + case ex: Exception => + logger.error("DruidRouterFunction:processElement() - Exception: ", ex.getMessage) + ex.printStackTrace() + } } } +// $COVERAGE-ON$ \ No newline at end of file diff --git a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DynamicRouterFunction.scala b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DynamicRouterFunction.scala new file mode 100644 index 00000000..ed50c8eb --- /dev/null +++ b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/functions/DynamicRouterFunction.scala @@ -0,0 +1,116 @@ +package org.sunbird.obsrv.router.functions + +import com.fasterxml.jackson.databind.JsonNode +import com.fasterxml.jackson.databind.node.JsonNodeType +import org.apache.flink.configuration.Configuration +import org.apache.flink.streaming.api.functions.ProcessFunction +import org.joda.time.format.DateTimeFormat +import org.joda.time.{DateTime, DateTimeZone} +import org.slf4j.LoggerFactory +import org.sunbird.obsrv.core.model.{Constants, ErrorConstants, FunctionalError, Producer} +import org.sunbird.obsrv.core.streaming.Metrics +import org.sunbird.obsrv.core.util.{JSONUtil, Util} +import org.sunbird.obsrv.model.DatasetModels.{Dataset, DatasetConfig} +import org.sunbird.obsrv.router.task.DruidRouterConfig +import org.sunbird.obsrv.streaming.BaseDatasetProcessFunction + +import java.util.TimeZone +import scala.collection.mutable + +case class TimestampKey(isValid: Boolean, value: AnyRef) + +class DynamicRouterFunction(config: DruidRouterConfig) extends BaseDatasetProcessFunction(config) { + + private[this] val logger = LoggerFactory.getLogger(classOf[DynamicRouterFunction]) + + override def open(parameters: Configuration): Unit = { + super.open(parameters) + } + + override def close(): Unit = { + super.close() + } + + override def getMetrics(): List[String] = { + List(config.routerTotalCount, config.routerSuccessCount) + } + + override def processElement(dataset: Dataset, msg: mutable.Map[String, AnyRef], + ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + metrics: Metrics): Unit = { + + metrics.incCounter(dataset.id, config.routerTotalCount) + val event = Util.getMutableMap(msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) + event.put(config.CONST_OBSRV_META, msg(config.CONST_OBSRV_META).asInstanceOf[Map[String, AnyRef]]) + val tsKeyData = TimestampKeyParser.parseTimestampKey(dataset.datasetConfig, event) + event.put("indexTS", tsKeyData.value) + if (tsKeyData.isValid) { + val routerConfig = dataset.routerConfig + val topicEventMap = mutable.Map(Constants.TOPIC -> routerConfig.topic, Constants.MESSAGE -> event) + ctx.output(config.routerOutputTag, topicEventMap) + metrics.incCounter(dataset.id, config.routerSuccessCount) + markCompletion(dataset, super.markComplete(event, dataset.dataVersion), ctx, Producer.router) + } else { + markFailure(Some(dataset.id), msg, ctx, metrics, ErrorConstants.INDEX_KEY_MISSING_OR_BLANK, Producer.router, FunctionalError.MissingTimestampKey, datasetType = Some(dataset.datasetType)) + } + } + +} + +object TimestampKeyParser { + + def parseTimestampKey(datasetConfig: DatasetConfig, event: mutable.Map[String, AnyRef]): TimestampKey = { + val indexKey = datasetConfig.tsKey + val node = JSONUtil.getKey(indexKey, JSONUtil.serialize(event)) + node.getNodeType match { + case JsonNodeType.NUMBER => onNumber(datasetConfig, node) + case JsonNodeType.STRING => onText(datasetConfig, node) + case _ => TimestampKey(isValid = false, null) + } + } + + private def onNumber(datasetConfig: DatasetConfig, node: JsonNode): TimestampKey = { + val length = node.asText().length + val value = node.numberValue().longValue() + // TODO: [P3] Crude implementation. Checking if the epoch timestamp format is one of seconds, milli-seconds, micro-second and nano-seconds. Find a elegant approach + if (length == 10 || length == 13 || length == 16 || length == 19) { + val tfValue:Long = if (length == 10) (value * 1000).longValue() else if (length == 16) (value / 1000).longValue() else if (length == 19) (value / 1000000).longValue() else value + TimestampKey(isValid = true, addTimeZone(datasetConfig, new DateTime(tfValue)).asInstanceOf[AnyRef]) + } else { + TimestampKey(isValid = false, 0.asInstanceOf[AnyRef]) + } + } + + private def onText(datasetConfig: DatasetConfig, node: JsonNode): TimestampKey = { + val value = node.textValue() + if (datasetConfig.tsFormat.isDefined) { + parseDateTime(datasetConfig, value) + } else { + TimestampKey(isValid = true, value) + } + } + + private def parseDateTime(datasetConfig: DatasetConfig, value: String): TimestampKey = { + try { + datasetConfig.tsFormat.get match { + case "epoch" => TimestampKey(isValid = true, addTimeZone(datasetConfig, new DateTime(value.toLong)).asInstanceOf[AnyRef]) + case _ => + val dtf = DateTimeFormat.forPattern(datasetConfig.tsFormat.get) + TimestampKey(isValid = true, addTimeZone(datasetConfig, dtf.parseDateTime(value)).asInstanceOf[AnyRef]) + } + } catch { + case _: Exception => TimestampKey(isValid = false, null) + } + } + + private def addTimeZone(datasetConfig: DatasetConfig, dateTime: DateTime): Long = { + if (datasetConfig.datasetTimezone.isDefined) { + val tz = DateTimeZone.forTimeZone(TimeZone.getTimeZone(datasetConfig.datasetTimezone.get)) + val offsetInMilliseconds = tz.getOffset(dateTime) + dateTime.plusMillis(offsetInMilliseconds).getMillis + } else { + dateTime.getMillis + } + } + +} diff --git a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterConfig.scala b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterConfig.scala index b67267a4..31106b00 100644 --- a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterConfig.scala +++ b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterConfig.scala @@ -22,6 +22,7 @@ class DruidRouterConfig(override val config: Config) extends BaseJobConfig[mutab val routerSuccessCount = "router-success-count" val statsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("processing_stats") + val routerOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("druid-routing-output") // Functions val druidRouterFunction = "DruidRouterFunction" @@ -41,4 +42,6 @@ class DruidRouterConfig(override val config: Config) extends BaseJobConfig[mutab override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = { statsOutputTag } + + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") } diff --git a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterStreamTask.scala b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterStreamTask.scala index bff7b644..b77e110a 100644 --- a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterStreamTask.scala +++ b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DruidRouterStreamTask.scala @@ -1,7 +1,6 @@ package org.sunbird.obsrv.router.task import com.typesafe.config.ConfigFactory -import org.apache.flink.api.common.eventtime.WatermarkStrategy import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.api.java.utils.ParameterTool @@ -19,19 +18,18 @@ import scala.collection.mutable /** * Druid Router stream task routes every event into its respective topic configured at dataset level */ - +// $COVERAGE-OFF$ Disabling scoverage as this stream task is deprecated +@Deprecated class DruidRouterStreamTask(config: DruidRouterConfig, kafkaConnector: FlinkKafkaConnector) extends BaseStreamTask[mutable.Map[String, AnyRef]] { private val serialVersionUID = 146697324640926024L - // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster def process(): Unit = { implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) val dataStream = getMapDataStream(env, config, kafkaConnector) processStream(dataStream) env.execute(config.jobName) } - // $COVERAGE-ON$ override def processStream(dataStream: DataStream[mutable.Map[String, AnyRef]]): DataStream[mutable.Map[String, AnyRef]] = { @@ -42,20 +40,22 @@ class DruidRouterStreamTask(config: DruidRouterConfig, kafkaConnector: FlinkKafk .setParallelism(config.downstreamOperatorsParallelism) datasets.map(dataset => { routerStream.getSideOutput(OutputTag[mutable.Map[String, AnyRef]](dataset.routerConfig.topic)) - .sinkTo(kafkaConnector.kafkaMapSink(dataset.routerConfig.topic)) + .sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](dataset.routerConfig.topic)) .name(dataset.id + "-" + config.druidRouterProducer).uid(dataset.id + "-" + config.druidRouterProducer) .setParallelism(config.downstreamOperatorsParallelism) }) - routerStream.getSideOutput(config.statsOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaStatsTopic)) + routerStream.getSideOutput(config.statsOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaStatsTopic)) .name(config.processingStatsProducer).uid(config.processingStatsProducer).setParallelism(config.downstreamOperatorsParallelism) + addDefaultSinks(routerStream, config, kafkaConnector) routerStream.getSideOutput(config.successTag()) } } - +// $COVERAGE-ON$ // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster +@Deprecated object DruidRouterStreamTask { def main(args: Array[String]): Unit = { diff --git a/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DynamicRouterStreamTask.scala b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DynamicRouterStreamTask.scala new file mode 100644 index 00000000..9e17a974 --- /dev/null +++ b/pipeline/druid-router/src/main/scala/org/sunbird/obsrv/router/task/DynamicRouterStreamTask.scala @@ -0,0 +1,66 @@ +package org.sunbird.obsrv.router.task + +import com.typesafe.config.ConfigFactory +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.java.typeutils.TypeExtractor +import org.apache.flink.api.java.utils.ParameterTool +import org.apache.flink.streaming.api.datastream.DataStream +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.sunbird.obsrv.core.streaming.{BaseStreamTask, FlinkKafkaConnector} +import org.sunbird.obsrv.core.util.FlinkUtil +import org.sunbird.obsrv.router.functions.DynamicRouterFunction + +import java.io.File +import scala.collection.mutable + +/** + * Druid Router stream task routes every event into its respective topic configured at dataset level + */ + +class DynamicRouterStreamTask(config: DruidRouterConfig, kafkaConnector: FlinkKafkaConnector) extends BaseStreamTask[mutable.Map[String, AnyRef]] { + + private val serialVersionUID = 146697324640926024L + + // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster + def process(): Unit = { + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) + process(env) + env.execute(config.jobName) + } + // $COVERAGE-ON$ + + def process(env: StreamExecutionEnvironment): Unit = { + val dataStream = getMapDataStream(env, config, kafkaConnector) + processStream(dataStream) + } + + override def processStream(dataStream: DataStream[mutable.Map[String, AnyRef]]): DataStream[mutable.Map[String, AnyRef]] = { + + implicit val mapTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) + + val routerStream = dataStream.process(new DynamicRouterFunction(config)).name(config.druidRouterFunction).uid(config.druidRouterFunction) + .setParallelism(config.downstreamOperatorsParallelism) + + routerStream.getSideOutput(config.routerOutputTag).sinkTo(kafkaConnector.kafkaMapDynamicSink()) + .name(config.druidRouterProducer).uid(config.druidRouterProducer).setParallelism(config.downstreamOperatorsParallelism) + + addDefaultSinks(routerStream, config, kafkaConnector) + routerStream.getSideOutput(config.successTag()) + } +} + +// $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster +object DynamicRouterStreamTask { + + def main(args: Array[String]): Unit = { + val configFilePath = Option(ParameterTool.fromArgs(args).get("config.file.path")) + val config = configFilePath.map { + path => ConfigFactory.parseFile(new File(path)).resolve() + }.getOrElse(ConfigFactory.load("druid-router.conf").withFallback(ConfigFactory.systemEnvironment())) + val druidRouterConfig = new DruidRouterConfig(config) + val kafkaUtil = new FlinkKafkaConnector(druidRouterConfig) + val task = new DynamicRouterStreamTask(druidRouterConfig, kafkaUtil) + task.process() + } +} +// $COVERAGE-ON$ \ No newline at end of file diff --git a/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/DynamicRouterStreamTaskTestSpec.scala b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/DynamicRouterStreamTaskTestSpec.scala new file mode 100644 index 00000000..0c45a555 --- /dev/null +++ b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/DynamicRouterStreamTaskTestSpec.scala @@ -0,0 +1,171 @@ +package org.sunbird.obsrv.router + +import io.github.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig} +import org.apache.flink.configuration.Configuration +import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.test.util.MiniClusterWithClientResource +import org.apache.kafka.common.serialization.StringDeserializer +import org.scalatest.Matchers._ +import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.model.Models.SystemEvent +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} +import org.sunbird.obsrv.router.task.{DruidRouterConfig, DynamicRouterStreamTask} +import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry + +import scala.collection.mutable +import scala.concurrent.ExecutionContext.Implicits.global +import scala.concurrent.Future +import scala.concurrent.duration._ + +class DynamicRouterStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { + + val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() + .setConfiguration(testConfiguration()) + .setNumberSlotsPerTaskManager(1) + .setNumberTaskManagers(1) + .build) + + val routerConfig = new DruidRouterConfig(config) + val kafkaConnector = new FlinkKafkaConnector(routerConfig) + val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") + implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = + EmbeddedKafkaConfig( + kafkaPort = 9093, + zooKeeperPort = 2183, + customConsumerProperties = customKafkaConsumerProperties + ) + implicit val deserializer: StringDeserializer = new StringDeserializer() + + def testConfiguration(): Configuration = { + val config = new Configuration() + config.setString("metrics.reporter", "job_metrics_reporter") + config.setString("metrics.reporter.job_metrics_reporter.class", classOf[BaseMetricsReporter].getName) + config + } + + override def beforeAll(): Unit = { + super.beforeAll() + BaseMetricsReporter.gaugeMetrics.clear() + EmbeddedKafka.start()(embeddedKafkaConfig) + val postgresConnect = new PostgresConnect(postgresConfig) + insertTestData(postgresConnect) + postgresConnect.closeConnection() + createTestTopics() + publishMessagesToKafka() + flinkCluster.before() + } + + private def publishMessagesToKafka(): Unit = { + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.SUCCESS_EVENT) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.FAILED_EVENT) + } + + private def insertTestData(postgresConnect: PostgresConnect): Unit = { + postgresConnect.execute("update datasets set dataset_config = '" + """{"data_key":"id","timestamp_key":"date1","entry_topic":"ingest"}""" + "' where id='d2';") + + } + + override def afterAll(): Unit = { + + super.afterAll() + flinkCluster.after() + EmbeddedKafka.stop() + } + + def createTestTopics(): Unit = { + List( + routerConfig.kafkaSystemTopic, routerConfig.kafkaInputTopic, "d1-events", routerConfig.kafkaFailedTopic + ).foreach(EmbeddedKafka.createCustomTopic(_)) + } + + "DynamicRouterStreamTaskTestSpec" should "validate the router stream task" in { + + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(routerConfig) + val task = new DynamicRouterStreamTask(routerConfig, kafkaConnector) + task.process(env) + Future { + env.execute(routerConfig.jobName) + } + + val outputs = EmbeddedKafka.consumeNumberMessagesFrom[String]("d1-events", 1, timeout = 30.seconds) + validateOutputs(outputs) + + val failedEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](routerConfig.kafkaFailedTopic, 1, timeout = 30.seconds) + validateFailedEvents(failedEvents) + + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](routerConfig.kafkaSystemTopic, 2, timeout = 30.seconds) + validateSystemEvents(systemEvents) + + val mutableMetricsMap = mutable.Map[String, Long]() + BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### DynamicRouterStreamTaskTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + validateMetrics(mutableMetricsMap) + } + + private def validateOutputs(outputs: List[String]): Unit = { + outputs.size should be(1) + Console.println("Output", outputs.head) + } + + private def validateFailedEvents(failedEvents: List[String]): Unit = { + failedEvents.size should be(1) + Console.println("Output", failedEvents.head) + } + + private def validateSystemEvents(systemEvents: List[String]): Unit = { + systemEvents.size should be(2) + + systemEvents.foreach(se => { + val event = JSONUtil.deserialize[SystemEvent](se) + val error = event.data.error + if (event.ctx.dataset.getOrElse("ALL").equals("ALL")) + event.ctx.dataset_type should be(None) + else if (error.isDefined) { + val errorCode = error.get.error_code + if (errorCode.equals(ErrorConstants.MISSING_DATASET_ID.errorCode) || + errorCode.equals(ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode) || + errorCode.equals(ErrorConstants.EVENT_MISSING.errorCode)) { + event.ctx.dataset_type should be(None) + } + } + else + event.ctx.dataset_type should be(Some("dataset")) + }) + + systemEvents.foreach(f => { + val event = JSONUtil.deserialize[SystemEvent](f) + event.etype should be(EventID.METRIC) + event.ctx.module should be(ModuleID.processing) + event.ctx.pdata.id should be(routerConfig.jobName) + event.ctx.pdata.`type` should be(PDataType.flink) + event.ctx.pdata.pid.get should be(Producer.router) + if(event.data.error.isDefined) { + val errorLog = event.data.error.get + errorLog.error_level should be(ErrorLevel.critical) + errorLog.pdata_id should be(Producer.router) + errorLog.pdata_status should be(StatusCode.failed) + errorLog.error_count.get should be(1) + errorLog.error_code should be(ErrorConstants.INDEX_KEY_MISSING_OR_BLANK.errorCode) + errorLog.error_message should be(ErrorConstants.INDEX_KEY_MISSING_OR_BLANK.errorMsg) + errorLog.error_type should be(FunctionalError.MissingTimestampKey) + } else { + event.data.pipeline_stats.isDefined should be (true) + event.data.pipeline_stats.get.latency_time.isDefined should be (true) + event.data.pipeline_stats.get.processing_time.isDefined should be (true) + event.data.pipeline_stats.get.total_processing_time.isDefined should be (true) + } + + }) + } + + private def validateMetrics(mutableMetricsMap: mutable.Map[String, Long]): Unit = { + mutableMetricsMap(s"${routerConfig.jobName}.d1.${routerConfig.routerTotalCount}") should be(1) + mutableMetricsMap(s"${routerConfig.jobName}.d1.${routerConfig.routerSuccessCount}") should be(1) + mutableMetricsMap(s"${routerConfig.jobName}.d2.${routerConfig.routerTotalCount}") should be(1) + mutableMetricsMap(s"${routerConfig.jobName}.d2.${routerConfig.eventFailedMetricsCount}") should be(1) + } + +} diff --git a/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/EventFixture.scala b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/EventFixture.scala new file mode 100644 index 00000000..7856b0cc --- /dev/null +++ b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/EventFixture.scala @@ -0,0 +1,7 @@ +package org.sunbird.obsrv.router + +object EventFixture { + + val SUCCESS_EVENT = """{"dataset":"d1","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val FAILED_EVENT = """{"dataset":"d2","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"D123","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" +} diff --git a/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/TestTimestampKeyParser.scala b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/TestTimestampKeyParser.scala new file mode 100644 index 00000000..f35567f0 --- /dev/null +++ b/pipeline/druid-router/src/test/scala/org/sunbird/obsrv/router/TestTimestampKeyParser.scala @@ -0,0 +1,124 @@ +package org.sunbird.obsrv.router + +import org.scalatest.{FlatSpec, Matchers} +import org.sunbird.obsrv.core.util.JSONUtil +import org.sunbird.obsrv.model.DatasetModels.DatasetConfig +import org.sunbird.obsrv.router.functions.TimestampKeyParser + +import scala.collection.mutable + +class TestTimestampKeyParser extends FlatSpec with Matchers { + + "TimestampKeyParser" should "validate all scenarios of timestamp key in number format" in { + + + // Validate text date field without providing dateformat and timezone + val result1 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01"}""")) + result1.isValid should be (true) + result1.value.asInstanceOf[String] should be ("2023-03-01") + + // Validate missing timestamp key scenario + val result2 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date1", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01"}""")) + result2.isValid should be(false) + result2.value should be(null) + + // Validate number date field which is not epoch + val result3 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":20232201}""")) + result3.isValid should be(false) + result3.value.asInstanceOf[Int] should be(0) + + // Validate number date field which is epoch in seconds + val result4 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":1701373165}""")) + result4.isValid should be(true) + result4.value.asInstanceOf[Long] should be(1701373165000l) + + // Validate number date field which is epoch in milli-seconds + val result5 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":1701373165123}""")) + result5.isValid should be(true) + result5.value.asInstanceOf[Long] should be(1701373165123l) + + // Validate number date field which is epoch in micro-seconds + val result6 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":1701373165123111}""")) + result6.isValid should be(true) + result6.value.asInstanceOf[Long] should be(1701373165123l) + + // Validate number date field which is epoch in nano-seconds + val result7 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":1701373165123111000}""")) + result7.isValid should be(true) + result7.value.asInstanceOf[Long] should be(1701373165123l) + + // Validate number date field which is not an epoch in milli, micro or nano seconds + val result8 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":170137316512}""")) + result8.isValid should be(false) + result8.value.asInstanceOf[Int] should be(0) + + // Validate number date field which is an epoch with timezone present + val result9 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = None, datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":1701373165123}""")) + result9.isValid should be(true) + result9.value.asInstanceOf[Long] should be(1701392965123l) + } + + it should "validate all scenarios of timestamp key in text format" in { + + // Validate epoch data in text format + val result1 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("epoch"), datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"1701373165123"}""")) + result1.isValid should be(true) + result1.value.asInstanceOf[Long] should be(1701392965123l) + + // Validate invalid epoch data in text format (would reset to millis from 1970-01-01 if not epoch in millis) + val result2 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("epoch"), datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"170137316512"}""")) + result2.isValid should be(true) + result2.value.asInstanceOf[Long] should be(170157116512l) + + // Validate date parser without timezone + val result3 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("yyyy-MM-dd"), datasetTimezone = None), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01"}""")) + result3.isValid should be(true) + result3.value.asInstanceOf[Long] should be(1677609000000l) + + // Validate date parser with timezone + val result4 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("yyyy-MM-dd"), datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01"}""")) + result4.isValid should be(true) + result4.value.asInstanceOf[Long] should be(1677628800000l) + + // Validate date parser with date time in nano seconds + val result5 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("yyyy-MM-dd'T'HH:mm:ss.SSSSSSSSS"), datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01T12:45:32.123456789"}""")) + result5.isValid should be(true) + result5.value.asInstanceOf[Long] should be(1677674732123l) + + // Validate date parser with data in invalid format + val result6 = TimestampKeyParser.parseTimestampKey( + DatasetConfig(key = "id", tsKey = "date", entryTopic = "ingest", excludeFields = None, redisDBHost = None, redisDBPort = None, redisDB = None, indexData = None, tsFormat = Some("yyyy-MM-dd'T'HH:mm:ss.SSS"), datasetTimezone = Some("GMT+05:30")), + JSONUtil.deserialize[mutable.Map[String, AnyRef]]("""{"id":1234, "date":"2023-03-01T12:45:32.123456"}""")) + result6.isValid should be(false) + result6.value should be(null) + } + +} \ No newline at end of file diff --git a/pipeline/extractor/pom.xml b/pipeline/extractor/pom.xml index b73d697d..95ac031f 100644 --- a/pipeline/extractor/pom.xml +++ b/pipeline/extractor/pom.xml @@ -47,6 +47,18 @@ framework 1.0.0 + + org.apache.kafka + kafka-clients + ${kafka.version} + test + + + org.apache.kafka + kafka_${scala.maj.version} + ${kafka.version} + test + org.sunbird.obsrv framework @@ -54,6 +66,13 @@ test-jar test + + org.sunbird.obsrv + dataset-registry + 1.0.0 + test-jar + test + org.apache.flink flink-test-utils @@ -68,9 +87,21 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 + test + + + io.github.embeddedkafka + embedded-kafka_2.12 + 3.4.0 + test + + + io.zonky.test + embedded-postgres + 2.0.3 test diff --git a/pipeline/extractor/src/main/resources/extractor.conf b/pipeline/extractor/src/main/resources/extractor.conf index a406d3c6..103649d0 100644 --- a/pipeline/extractor/src/main/resources/extractor.conf +++ b/pipeline/extractor/src/main/resources/extractor.conf @@ -3,9 +3,8 @@ include "baseconfig.conf" kafka { input.topic = ${job.env}".ingest" output.raw.topic = ${job.env}".raw" - output.extractor.duplicate.topic = ${job.env}".extractor.duplicate" - output.failed.topic = ${job.env}".failed" - output.batch.failed.topic = ${job.env}".extractor.failed" + output.extractor.duplicate.topic = ${job.env}".failed" + output.batch.failed.topic = ${job.env}".failed" event.max.size = "1048576" # Max is only 1MB groupId = ${job.env}"-extractor-group" producer { diff --git a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/functions/ExtractionFunction.scala b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/functions/ExtractionFunction.scala index f8f4520c..f1fea9fb 100644 --- a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/functions/ExtractionFunction.scala +++ b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/functions/ExtractionFunction.scala @@ -2,12 +2,14 @@ package org.sunbird.obsrv.extractor.functions import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.ProcessFunction +import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.cache.{DedupEngine, RedisConnect} import org.sunbird.obsrv.core.exception.ObsrvException -import org.sunbird.obsrv.core.model.ErrorConstants import org.sunbird.obsrv.core.model.ErrorConstants.Error -import org.sunbird.obsrv.core.model.Models.{PData, SystemEvent} -import org.sunbird.obsrv.core.streaming.{BaseProcessFunction, Metrics, MetricsList} +import org.sunbird.obsrv.core.model.FunctionalError.FunctionalError +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.{BaseDeduplication, BaseProcessFunction, Metrics, MetricsList} import org.sunbird.obsrv.core.util.Util.getMutableMap import org.sunbird.obsrv.core.util.{JSONUtil, Util} import org.sunbird.obsrv.extractor.task.ExtractorConfig @@ -16,44 +18,55 @@ import org.sunbird.obsrv.registry.DatasetRegistry import scala.collection.mutable -class ExtractionFunction(config: ExtractorConfig, @transient var dedupEngine: DedupEngine = null) - extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { +class ExtractionFunction(config: ExtractorConfig) + extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) with BaseDeduplication { + + @transient private var dedupEngine: DedupEngine = null + private[this] val logger = LoggerFactory.getLogger(classOf[ExtractionFunction]) override def getMetricsList(): MetricsList = { - val metrics = List(config.successEventCount, config.systemEventCount, config.failedEventCount, config.failedExtractionCount, + val metrics = List(config.successEventCount, config.systemEventCount, config.eventFailedMetricsCount, config.failedExtractionCount, config.skippedExtractionCount, config.duplicateExtractionCount, config.totalEventCount, config.successExtractionCount) MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) } override def open(parameters: Configuration): Unit = { super.open(parameters) - if (dedupEngine == null) { - val redisConnect = new RedisConnect(config.redisHost, config.redisPort, config.redisConnectionTimeout) - dedupEngine = new DedupEngine(redisConnect, config.dedupStore, config.cacheExpiryInSeconds) - } + val redisConnect = new RedisConnect(config.redisHost, config.redisPort, config.redisConnectionTimeout) + dedupEngine = new DedupEngine(redisConnect, config.dedupStore, config.cacheExpiryInSeconds) } override def processElement(batchEvent: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { metrics.incCounter(config.defaultDatasetID, config.totalEventCount) - val datasetId = batchEvent.get(config.CONST_DATASET) - if (datasetId.isEmpty) { + if (batchEvent.contains(Constants.INVALID_JSON)) { + context.output(config.failedBatchEventOutputTag, markBatchFailed(batchEvent, ErrorConstants.ERR_INVALID_EVENT)) + metrics.incCounter(config.defaultDatasetID, config.eventFailedMetricsCount) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(config.defaultDatasetID), ErrorConstants.ERR_INVALID_EVENT, FunctionalError.InvalidJsonData)) + return + } + val eventAsText = JSONUtil.serialize(batchEvent) + val datasetIdOpt = batchEvent.get(config.CONST_DATASET) + if (datasetIdOpt.isEmpty) { context.output(config.failedBatchEventOutputTag, markBatchFailed(batchEvent, ErrorConstants.MISSING_DATASET_ID)) - metrics.incCounter(config.defaultDatasetID, config.failedExtractionCount) + metrics.incCounter(config.defaultDatasetID, config.eventFailedMetricsCount) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(config.defaultDatasetID), ErrorConstants.MISSING_DATASET_ID, FunctionalError.MissingDatasetId)) return } - val datasetOpt = DatasetRegistry.getDataset(datasetId.get.asInstanceOf[String]) + val datasetId = datasetIdOpt.get.asInstanceOf[String] + metrics.incCounter(datasetId, config.totalEventCount) + val datasetOpt = DatasetRegistry.getDataset(datasetId) if (datasetOpt.isEmpty) { context.output(config.failedBatchEventOutputTag, markBatchFailed(batchEvent, ErrorConstants.MISSING_DATASET_CONFIGURATION)) - metrics.incCounter(config.defaultDatasetID, config.failedExtractionCount) + metrics.incCounter(datasetId, config.failedExtractionCount) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(datasetId), ErrorConstants.MISSING_DATASET_CONFIGURATION, FunctionalError.MissingDatasetId)) return } val dataset = datasetOpt.get if (!containsEvent(batchEvent) && dataset.extractionConfig.isDefined && dataset.extractionConfig.get.isBatchEvent.get) { - val eventAsText = JSONUtil.serialize(batchEvent) if (dataset.extractionConfig.get.dedupConfig.isDefined && dataset.extractionConfig.get.dedupConfig.get.dropDuplicates.get) { - val isDup = isDuplicate(dataset.id, dataset.extractionConfig.get.dedupConfig.get.dedupKey, eventAsText, context, config)(dedupEngine) + val isDup = isDuplicate(dataset, dataset.extractionConfig.get.dedupConfig.get.dedupKey, eventAsText, context) if (isDup) { metrics.incCounter(dataset.id, config.duplicateExtractionCount) context.output(config.duplicateEventOutputTag, markBatchFailed(batchEvent, ErrorConstants.DUPLICATE_BATCH_EVENT_FOUND)) @@ -66,20 +79,40 @@ class ExtractionFunction(config: ExtractorConfig, @transient var dedupEngine: De } } + private def isDuplicate(dataset: Dataset, dedupKey: Option[String], event: String, + context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Boolean = { + try { + super.isDuplicate(dataset.id, dedupKey, event)(dedupEngine) + } catch { + case ex: ObsrvException => + val sysEvent = JSONUtil.serialize(SystemEvent( + EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.extractor)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData(error = Some(ErrorLog(pdata_id = Producer.dedup, pdata_status = StatusCode.skipped, error_type = FunctionalError.DedupFailed, error_code = ex.error.errorCode, error_message = ex.error.errorMsg, error_level = ErrorLevel.warn))) + )) + logger.warn("BaseDeduplication:isDuplicate() | Exception", ex) + context.output(config.systemEventsOutputTag, sysEvent) + false + } + } + private def skipExtraction(dataset: Dataset, batchEvent: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { val obsrvMeta = batchEvent(config.CONST_OBSRV_META).asInstanceOf[Map[String, AnyRef]] if (!super.containsEvent(batchEvent)) { - metrics.incCounter(dataset.id, config.failedEventCount) - context.output(config.failedEventsOutputTag, markBatchFailed(batchEvent, ErrorConstants.EVENT_MISSING)) + metrics.incCounter(dataset.id, config.eventFailedMetricsCount) + context.output(config.failedEventsOutputTag(), markBatchFailed(batchEvent, ErrorConstants.EVENT_MISSING)) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(dataset.id), ErrorConstants.EVENT_MISSING, FunctionalError.MissingEventData, dataset_type = Some(dataset.datasetType))) return } val eventData = Util.getMutableMap(batchEvent(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]]) val eventJson = JSONUtil.serialize(eventData) val eventSize = eventJson.getBytes("UTF-8").length if (eventSize > config.eventMaxSize) { - metrics.incCounter(dataset.id, config.failedEventCount) - context.output(config.failedEventsOutputTag, markEventFailed(dataset.id, eventData, ErrorConstants.EVENT_SIZE_EXCEEDED, obsrvMeta)) + metrics.incCounter(dataset.id, config.eventFailedMetricsCount) + context.output(config.failedEventsOutputTag(), markEventFailed(dataset.id, eventData, ErrorConstants.EVENT_SIZE_EXCEEDED, obsrvMeta)) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(dataset.id), ErrorConstants.EVENT_SIZE_EXCEEDED, FunctionalError.EventSizeExceeded, dataset_type = Some(dataset.datasetType))) + logger.error(s"Extractor | Event size exceeded max configured value | dataset=${dataset.id} | Event size is $eventSize, Max configured size is ${config.eventMaxSize}") } else { metrics.incCounter(dataset.id, config.skippedExtractionCount) context.output(config.rawEventsOutputTag, markEventSkipped(dataset.id, eventData, obsrvMeta)) @@ -96,21 +129,24 @@ class ExtractionFunction(config: ExtractorConfig, @transient var dedupEngine: De val eventJson = JSONUtil.serialize(eventData) val eventSize = eventJson.getBytes("UTF-8").length if (eventSize > config.eventMaxSize) { - metrics.incCounter(dataset.id, config.failedEventCount) - context.output(config.failedEventsOutputTag, markEventFailed(dataset.id, eventData, ErrorConstants.EVENT_SIZE_EXCEEDED, obsrvMeta)) + metrics.incCounter(dataset.id, config.eventFailedMetricsCount) + context.output(config.failedEventsOutputTag(), markEventFailed(dataset.id, eventData, ErrorConstants.EVENT_SIZE_EXCEEDED, obsrvMeta)) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(dataset.id), ErrorConstants.EVENT_SIZE_EXCEEDED, FunctionalError.EventSizeExceeded, dataset_type = Some(dataset.datasetType))) + logger.error(s"Extractor | Event size exceeded max configured value | dataset=${dataset.id} | Event size is $eventSize, Max configured size is ${config.eventMaxSize}") } else { metrics.incCounter(dataset.id, config.successEventCount) context.output(config.rawEventsOutputTag, markEventSuccess(dataset.id, eventData, obsrvMeta)) } }) - context.output(config.systemEventsOutputTag, JSONUtil.serialize(generateSystemEvent(dataset.id, eventsList.size))) + context.output(config.systemEventsOutputTag, JSONUtil.serialize(successSystemEvent(dataset, eventsList.size))) metrics.incCounter(dataset.id, config.systemEventCount) metrics.incCounter(dataset.id, config.successExtractionCount) } catch { case ex: ObsrvException => - context.output(config.failedBatchEventOutputTag, markBatchFailed(batchEvent, ex.error)) metrics.incCounter(dataset.id, config.failedExtractionCount) - case re: Exception => re.printStackTrace() + context.output(config.failedBatchEventOutputTag, markBatchFailed(batchEvent, ex.error)) + context.output(config.systemEventsOutputTag, failedSystemEvent(Some(dataset.id), ex.error, FunctionalError.ExtractionDataFormatInvalid, dataset_type = Some(dataset.datasetType))) + logger.error(s"Extractor | Exception extracting data | dataset=${dataset.id}", ex) } } @@ -133,8 +169,20 @@ class ExtractionFunction(config: ExtractorConfig, @transient var dedupEngine: De /** * Method to Generate a System Event to capture the extraction information and metrics */ - private def generateSystemEvent(dataset: String, totalEvents: Int): SystemEvent = { - SystemEvent(PData(config.jobName, "flink", ""), Map("totalEvents" -> totalEvents.asInstanceOf[AnyRef], "dataset" -> dataset.asInstanceOf[AnyRef])); // TODO: Generate a system event + private def successSystemEvent(dataset: Dataset, totalEvents: Int): SystemEvent = { + SystemEvent( + EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.extractor)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData(error = None, pipeline_stats = Some(PipelineStats(Some(totalEvents), Some(StatusCode.success)))) + ) + } + + private def failedSystemEvent(dataset: Option[String], error: Error, functionalError: FunctionalError, dataset_type: Option[String] = None): String = { + + JSONUtil.serialize(SystemEvent( + EventID.METRIC, ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.extractor)), dataset = dataset, dataset_type = dataset_type), + data = EData(error = Some(ErrorLog(Producer.extractor, StatusCode.failed, functionalError, error.errorCode, error.errorMsg, ErrorLevel.critical)), pipeline_stats = None) + )) } /** @@ -143,32 +191,30 @@ class ExtractionFunction(config: ExtractorConfig, @transient var dedupEngine: De private def markEventFailed(dataset: String, event: mutable.Map[String, AnyRef], error: Error, obsrvMeta: Map[String, AnyRef]): mutable.Map[String, AnyRef] = { val wrapperEvent = createWrapperEvent(dataset, event) updateEvent(wrapperEvent, obsrvMeta) - super.markFailed(wrapperEvent, error, config.jobName) + super.markFailed(wrapperEvent, error, Producer.extractor) wrapperEvent } private def markBatchFailed(batchEvent: mutable.Map[String, AnyRef], error: Error): mutable.Map[String, AnyRef] = { - super.markFailed(batchEvent, error, config.jobName) + super.markFailed(batchEvent, error, Producer.extractor) batchEvent } private def markEventSuccess(dataset: String, event: mutable.Map[String, AnyRef], obsrvMeta: Map[String, AnyRef]): mutable.Map[String, AnyRef] = { val wrapperEvent = createWrapperEvent(dataset, event) updateEvent(wrapperEvent, obsrvMeta) - super.markSuccess(wrapperEvent, config.jobName) + super.markSuccess(wrapperEvent, Producer.extractor) wrapperEvent } private def markEventSkipped(dataset: String, event: mutable.Map[String, AnyRef], obsrvMeta: Map[String, AnyRef]): mutable.Map[String, AnyRef] = { val wrapperEvent = createWrapperEvent(dataset, event) updateEvent(wrapperEvent, obsrvMeta) - super.markSkipped(wrapperEvent, config.jobName) + super.markSkipped(wrapperEvent, Producer.extractor) wrapperEvent } - private def createWrapperEvent(dataset: String, event: mutable.Map[String, AnyRef]): mutable.Map[String, AnyRef] = { mutable.Map(config.CONST_DATASET -> dataset, config.CONST_EVENT -> event.toMap) } -} - +} \ No newline at end of file diff --git a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorConfig.scala b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorConfig.scala index 421f43e5..17c1bac9 100644 --- a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorConfig.scala +++ b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorConfig.scala @@ -1,6 +1,5 @@ package org.sunbird.obsrv.extractor.task -import scala.collection.mutable import com.typesafe.config.Config import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor @@ -8,6 +7,8 @@ import org.apache.flink.streaming.api.scala.OutputTag import org.sunbird.obsrv.core.model.SystemConfig import org.sunbird.obsrv.core.streaming.BaseJobConfig +import scala.collection.mutable + class ExtractorConfig(override val config: Config) extends BaseJobConfig[mutable.Map[String, AnyRef]](config, "ExtractorJob") { private val serialVersionUID = 2905979434303791379L @@ -16,32 +17,28 @@ class ExtractorConfig(override val config: Config) extends BaseJobConfig[mutable implicit val stringTypeInfo: TypeInformation[String] = TypeExtractor.getForClass(classOf[String]) val dedupStore: Int = config.getInt("redis.database.extractor.duplication.store.id") - val cacheExpiryInSeconds: Int = SystemConfig.defaultDedupPeriodInSeconds + def cacheExpiryInSeconds: Int = SystemConfig.getInt("defaultDedupPeriodInSeconds", 604800) // Kafka Topics Configuration val kafkaInputTopic: String = config.getString("kafka.input.topic") val kafkaSuccessTopic: String = config.getString("kafka.output.raw.topic") val kafkaDuplicateTopic: String = config.getString("kafka.output.extractor.duplicate.topic") - val kafkaFailedTopic: String = config.getString("kafka.output.failed.topic") val kafkaBatchFailedTopic: String = config.getString("kafka.output.batch.failed.topic") - val eventMaxSize: Long = SystemConfig.maxEventSize + def eventMaxSize: Long = if(config.hasPath("kafka.event.max.size")) config.getInt("kafka.event.max.size") else SystemConfig.getLong("maxEventSize", 1048576L) private val RAW_EVENTS_OUTPUT_TAG = "raw-events" - private val FAILED_EVENTS_OUTPUT_TAG = "failed-events" private val FAILED_BATCH_EVENTS_OUTPUT_TAG = "failed-batch-events" private val DUPLICATE_EVENTS_OUTPUT_TAG = "duplicate-batch-events" // Metric List - val totalEventCount = "total-event-count" - val successEventCount = "success-event-count" - val failedEventCount = "failed-event-count" - val failedExtractionCount = "failed-extraction-count" - val successExtractionCount = "success-extraction-count" - val duplicateExtractionCount = "duplicate-extraction-count" - val skippedExtractionCount = "skipped-extraction-count" + val totalEventCount = "extractor-total-count" + val successEventCount = "extractor-event-count" + val failedExtractionCount = "extractor-failed-count" + val successExtractionCount = "extractor-success-count" + val duplicateExtractionCount = "extractor-duplicate-count" + val skippedExtractionCount = "extractor-skipped-count" val rawEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](RAW_EVENTS_OUTPUT_TAG) - val failedEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](FAILED_EVENTS_OUTPUT_TAG) val failedBatchEventOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](FAILED_BATCH_EVENTS_OUTPUT_TAG) val duplicateEventOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](id = DUPLICATE_EVENTS_OUTPUT_TAG) @@ -52,10 +49,9 @@ class ExtractorConfig(override val config: Config) extends BaseJobConfig[mutable val extractorDuplicateProducer = "extractor-duplicate-events-sink" val extractorBatchFailedEventsProducer = "extractor-batch-failed-events-sink" val extractorRawEventsProducer = "extractor-raw-events-sink" - val extractorFailedEventsProducer = "extractor-failed-events-sink" override def inputTopic(): String = kafkaInputTopic override def inputConsumer(): String = "extractor-consumer" override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = rawEventsOutputTag - + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") } diff --git a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorStreamTask.scala b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorStreamTask.scala index b64b55ad..521ffc61 100644 --- a/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorStreamTask.scala +++ b/pipeline/extractor/src/main/scala/org/sunbird/obsrv/extractor/task/ExtractorStreamTask.scala @@ -22,33 +22,32 @@ class ExtractorStreamTask(config: ExtractorConfig, kafkaConnector: FlinkKafkaCon def process(): Unit = { implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) - val dataStream = getMapDataStream(env, config, kafkaConnector) - processStream(dataStream) + process(env) env.execute(config.jobName) } // $COVERAGE-ON$ + def process(env: StreamExecutionEnvironment): Unit = { + val dataStream = getMapDataStream(env, config, kafkaConnector) + processStream(dataStream) + } + override def processStream(dataStream: DataStream[mutable.Map[String, AnyRef]]): DataStream[mutable.Map[String, AnyRef]] = { val extractorStream = dataStream.process(new ExtractionFunction(config)) .name(config.extractionFunction).uid(config.extractionFunction) .setParallelism(config.downstreamOperatorsParallelism) - extractorStream.getSideOutput(config.failedBatchEventOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaBatchFailedTopic)) + extractorStream.getSideOutput(config.failedBatchEventOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaBatchFailedTopic)) .name(config.extractorBatchFailedEventsProducer).uid(config.extractorBatchFailedEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - extractorStream.getSideOutput(config.successTag()).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaSuccessTopic)) + extractorStream.getSideOutput(config.successTag()).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaSuccessTopic)) .name(config.extractorRawEventsProducer).uid(config.extractorRawEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - extractorStream.getSideOutput(config.duplicateEventOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaDuplicateTopic)) + extractorStream.getSideOutput(config.duplicateEventOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaDuplicateTopic)) .name(config.extractorDuplicateProducer).uid(config.extractorDuplicateProducer).setParallelism(config.downstreamOperatorsParallelism) - extractorStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaStringSink(config.kafkaSystemTopic)) - .name(config.systemEventsProducer).uid(config.systemEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - - extractorStream.getSideOutput(config.failedEventsOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaFailedTopic)) - .name(config.extractorFailedEventsProducer).uid(config.extractorFailedEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - + addDefaultSinks(extractorStream, config, kafkaConnector) extractorStream.getSideOutput(config.successTag()) } } diff --git a/pipeline/extractor/src/test/resources/test.conf b/pipeline/extractor/src/test/resources/test.conf index 6cfcefa4..33066c5c 100644 --- a/pipeline/extractor/src/test/resources/test.conf +++ b/pipeline/extractor/src/test/resources/test.conf @@ -3,10 +3,10 @@ include "base-test.conf" kafka { input.topic = "flink.ingest" output.raw.topic = "flink.raw" - output.extractor.duplicate.topic = "flink.extractor.duplicate" - output.failed.topic = "flink.failed" - output.batch.failed.topic = "flink.extractor.failed" - event.max.size = "1048576" # Max is only 1MB + output.extractor.duplicate.topic = "flink.failed" + + output.batch.failed.topic = "flink.failed" + event.max.size = "300" # Max is only 1MB groupId = "flink-extractor-group" } @@ -17,7 +17,7 @@ task { redis { host = 127.0.0.1 - port = 6379 + port = 6340 database { extractor.duplication.store.id = 1 key.expiry.seconds = 3600 diff --git a/pipeline/extractor/src/test/resources/test2.conf b/pipeline/extractor/src/test/resources/test2.conf new file mode 100644 index 00000000..a381de66 --- /dev/null +++ b/pipeline/extractor/src/test/resources/test2.conf @@ -0,0 +1,24 @@ +include "base-test.conf" + +kafka { + input.topic = "flink.ingest" + output.raw.topic = "flink.raw" + output.extractor.duplicate.topic = "flink.failed" + + output.batch.failed.topic = "flink.failed" + groupId = "flink-extractor-group" +} + +task { + consumer.parallelism = 1 + downstream.operators.parallelism = 1 +} + +redis { + host = 127.0.0.1 + port = 6340 + database { + extractor.duplication.store.id = 1 + key.expiry.seconds = 3600 + } +} \ No newline at end of file diff --git a/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/EventFixture.scala b/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/EventFixture.scala new file mode 100644 index 00000000..800587a7 --- /dev/null +++ b/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/EventFixture.scala @@ -0,0 +1,15 @@ +package org.sunbird.obsrv.extractor + +object EventFixture { + + val MISSING_DEDUP_KEY = """{"dataset":"d1","id1":"event1","events":[{"id":"1","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}]}""" + val INVALID_JSON = """{"dataset":"d1","event":{"id":"2","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""" + + val VALID_EVENT = """{"dataset":"d1","id":"event4","event":{"id":"3","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val VALID_BATCH = """{"dataset":"d1","id":"event5","events":[{"id":"4","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}]}""" + + val LARGE_JSON_BATCH = """{"dataset":"d1","id":"event2","events":[{"id":"5","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19},"randomKey":"eRJcFJvUoQnlC9ZNa2b2NT84aAv4Trr9m6GFwxaL6Qn1srmWBl7ldsKnBvs6ah2l0KN6M3Vp4eoGLBiIMYsi3gHWklc8sbt6"}]}""" + val LARGE_JSON_EVENT = """{"dataset":"d1","id":"event3","event":{"id":"6","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19},"randomKey":"eRJcFJvUoQnlC9ZNa2b2NT84aAv4Trr9m6GFwxaL6Qn1srmWBl7ldsKnBvs6ah2l0KN6M3Vp4eoGLBiIMYsi3gHWklc8sbt6"}}""" + + +} diff --git a/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/ExtractorStreamTestSpec.scala b/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/ExtractorStreamTestSpec.scala new file mode 100644 index 00000000..6ada824b --- /dev/null +++ b/pipeline/extractor/src/test/scala/org/sunbird/obsrv/extractor/ExtractorStreamTestSpec.scala @@ -0,0 +1,167 @@ +package org.sunbird.obsrv.extractor + +import com.typesafe.config.{Config, ConfigFactory} +import io.github.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig} +import org.apache.flink.configuration.Configuration +import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.test.util.MiniClusterWithClientResource +import org.apache.kafka.common.serialization.StringDeserializer +import org.scalatest.Matchers._ +import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.cache.RedisConnect +import org.sunbird.obsrv.core.model.Models.SystemEvent +import org.sunbird.obsrv.core.model.SystemConfig +import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil} +import org.sunbird.obsrv.extractor.task.{ExtractorConfig, ExtractorStreamTask} +import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry + +import scala.collection.mutable +import scala.concurrent.ExecutionContext.Implicits.global +import scala.concurrent.Future +import scala.concurrent.duration._ + +class ExtractorStreamTestSpec extends BaseSpecWithDatasetRegistry { + + val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() + .setConfiguration(testConfiguration()) + .setNumberSlotsPerTaskManager(1) + .setNumberTaskManagers(1) + .build) + + val pConfig = new ExtractorConfig(config) + val kafkaConnector = new FlinkKafkaConnector(pConfig) + val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") + implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = + EmbeddedKafkaConfig( + kafkaPort = 9093, + zooKeeperPort = 2183, + customConsumerProperties = customKafkaConsumerProperties + ) + implicit val deserializer: StringDeserializer = new StringDeserializer() + + def testConfiguration(): Configuration = { + val config = new Configuration() + config.setString("metrics.reporter", "job_metrics_reporter") + config.setString("metrics.reporter.job_metrics_reporter.class", classOf[BaseMetricsReporter].getName) + config + } + + override def beforeAll(): Unit = { + super.beforeAll() + BaseMetricsReporter.gaugeMetrics.clear() + EmbeddedKafka.start()(embeddedKafkaConfig) + createTestTopics() + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.INVALID_JSON) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.MISSING_DEDUP_KEY) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.LARGE_JSON_EVENT) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.LARGE_JSON_BATCH) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.VALID_EVENT) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixture.VALID_BATCH) + + flinkCluster.before() + } + + override def afterAll(): Unit = { + val redisConnection = new RedisConnect(pConfig.redisHost, pConfig.redisPort, pConfig.redisConnectionTimeout) + redisConnection.getConnection(config.getInt("redis.database.extractor.duplication.store.id")).flushAll() + super.afterAll() + flinkCluster.after() + EmbeddedKafka.stop() + } + + def createTestTopics(): Unit = { + List( + pConfig.kafkaInputTopic, pConfig.kafkaFailedTopic, pConfig.kafkaSystemTopic, pConfig.kafkaDuplicateTopic, pConfig.kafkaBatchFailedTopic + ).foreach(EmbeddedKafka.createCustomTopic(_)) + } + + "ExtractorStreamTestSpec" should "validate the negative scenarios in extractor job" in { + + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(pConfig) + val task = new ExtractorStreamTask(pConfig, kafkaConnector) + task.process(env) + Future { + env.execute(pConfig.jobName) + } + val batchFailedEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaBatchFailedTopic, 1, timeout = 30.seconds) + val invalidEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaFailedTopic, 2, timeout = 30.seconds) + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaSystemTopic, 6, timeout = 30.seconds) + val outputEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaSuccessTopic, 3, timeout = 30.seconds) + + validateOutputEvents(outputEvents) + validateBatchFailedEvents(batchFailedEvents) + validateInvalidEvents(invalidEvents) + validateSystemEvents(systemEvents) + + val mutableMetricsMap = mutable.Map[String, Long]() + BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### ExtractorStreamTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + validateMetrics(mutableMetricsMap) + + val config2: Config = ConfigFactory.load("test2.conf") + val extractorConfig = new ExtractorConfig(config2) + extractorConfig.eventMaxSize should be (SystemConfig.getLong("maxEventSize", 1048576L)) + } + + private def validateOutputEvents(outputEvents: List[String]) = { + outputEvents.size should be (3) + //TODO: Add assertions for all 3 events + /* + (OutEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"obsrv_meta":{"flags":{"extractor":"success"},"syncts":1701760331686,"prevProcessingTime":1701760337492,"error":{},"processingStartTime":1701760337087,"timespans":{"extractor":405}},"dataset":"d1"}) + (OutEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"3","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"obsrv_meta":{"flags":{"extractor":"skipped"},"syncts":1701760331771,"prevProcessingTime":1701760337761,"error":{},"processingStartTime":1701760337089,"timespans":{"extractor":672}},"dataset":"d1"}) + (OutEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"4","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"obsrv_meta":{"flags":{"extractor":"success"},"syncts":1701760331794,"prevProcessingTime":1701760337777,"error":{},"processingStartTime":1701760337092,"timespans":{"extractor":685}},"dataset":"d1"}) + */ + } + + private def validateBatchFailedEvents(batchFailedEvents: List[String]): Unit = { + batchFailedEvents.size should be(1) + //TODO: Add assertions for all 1 events + /* + (BatchFailedEvent,{"event":"{\"invalid_json\":\"{\\\"dataset\\\":\\\"d1\\\",\\\"event\\\":{\\\"id\\\":\\\"2\\\",\\\"vehicleCode\\\":\\\"HYUN-CRE-D6\\\",\\\"date\\\":\\\"2023-03-01\\\",\\\"dealer\\\":{\\\"dealerCode\\\":\\\"KUNUnited\\\",\\\"locationId\\\":\\\"KUN1\\\",\\\"email\\\":\\\"dealer1@gmail.com\\\",\\\"phone\\\":\\\"9849012345\\\"},\\\"metrics\\\":{\\\"bookingsTaken\\\":50,\\\"deliveriesPromised\\\":20,\\\"deliveriesDone\\\":19}}\"}","obsrv_meta":{"flags":{"extractor":"failed"},"syncts":1701758716432,"prevProcessingTime":1701758721945,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"extractor"},"error_code":"ERR_EXT_1018","error_msg":"Invalid JSON event, error while deserializing the event"},"processingStartTime":1701758721739,"timespans":{"extractor":206}},"invalid_json":"{\"dataset\":\"d1\",\"event\":{\"id\":\"2\",\"vehicleCode\":\"HYUN-CRE-D6\",\"date\":\"2023-03-01\",\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}}"}) + */ + } + + private def validateInvalidEvents(invalidEvents: List[String]): Unit = { + invalidEvents.size should be(2) + //TODO: Add assertions for all 2 events + /* + (FailedEvent,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"6\",\"randomKey\":\"eRJcFJvUoQnlC9ZNa2b2NT84aAv4Trr9m6GFwxaL6Qn1srmWBl7ldsKnBvs6ah2l0KN6M3Vp4eoGLBiIMYsi3gHWklc8sbt6\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d1\"}","obsrv_meta":{"flags":{"extractor":"failed"},"syncts":1701758716560,"prevProcessingTime":1701758722479,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"extractor"},"error_code":"ERR_EXT_1003","error_msg":"Event size has exceeded max configured size of 1048576"},"processingStartTime":1701758721888,"timespans":{"extractor":591}},"dataset":"d1"}) + (FailedEvent,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"5\",\"randomKey\":\"eRJcFJvUoQnlC9ZNa2b2NT84aAv4Trr9m6GFwxaL6Qn1srmWBl7ldsKnBvs6ah2l0KN6M3Vp4eoGLBiIMYsi3gHWklc8sbt6\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d1\"}","obsrv_meta":{"flags":{"extractor":"failed"},"syncts":1701758716590,"prevProcessingTime":1701758722521,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"extractor"},"error_code":"ERR_EXT_1003","error_msg":"Event size has exceeded max configured size of 1048576"},"processingStartTime":1701758721888,"timespans":{"extractor":633}},"dataset":"d1"}) + */ + } + + private def validateSystemEvents(systemEvents: List[String]): Unit = { + systemEvents.size should be(6) + + systemEvents.foreach(se => { + val event = JSONUtil.deserialize[SystemEvent](se) + if(event.ctx.dataset.getOrElse("ALL").equals("ALL")) + event.ctx.dataset_type should be(None) + else + event.ctx.dataset_type should be(Some("dataset")) + }) + + //TODO: Add assertions for all 6 events + /* + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"ALL"},"data":{"error":{"pdata_id":"extractor","pdata_status":"failed","error_type":"InvalidJsonData","error_code":"ERR_EXT_1018","error_message":"Invalid JSON event, error while deserializing the event","error_level":"critical"}},"ets":1701760337333}) + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"d1", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"dedup","pdata_status":"skipped","error_type":"DedupFailed","error_code":"ERR_DEDUP_1007","error_message":"No dedup key found or missing data","error_level":"warn"}},"ets":1701760337474}) + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"d1", "dataset_type": "dataset"},"data":{"pipeline_stats":{"extractor_events":1,"extractor_status":"success"}},"ets":1701760337655}) + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"d1", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"extractor","pdata_status":"failed","error_type":"EventSizeExceeded","error_code":"ERR_EXT_1003","error_message":"Event size has exceeded max configured size of 1048576","error_level":"critical"}},"ets":1701760337724}) + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"d1", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"extractor","pdata_status":"failed","error_type":"EventSizeExceeded","error_code":"ERR_EXT_1003","error_message":"Event size has exceeded max configured size of 1048576","error_level":"critical"}},"ets":1701760337754}) + (SysEvent,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"ExtractorJob","type":"flink","pid":"extractor"},"dataset":"d1", "dataset_type": "dataset"},"data":{"pipeline_stats":{"extractor_events":1,"extractor_status":"success"}},"ets":1701760337754}) + */ + } + + private def validateMetrics(mutableMetricsMap: mutable.Map[String, Long]): Unit = { + + mutableMetricsMap(s"${pConfig.jobName}.ALL.${pConfig.eventFailedMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.totalEventCount}") should be(5) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.eventFailedMetricsCount}") should be(2) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.skippedExtractionCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.successEventCount}") should be(2) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.successExtractionCount}") should be(3) + } + +} \ No newline at end of file diff --git a/pipeline/kafka-connector/pom.xml b/pipeline/kafka-connector/pom.xml index bdf2fe8f..65aa4d68 100644 --- a/pipeline/kafka-connector/pom.xml +++ b/pipeline/kafka-connector/pom.xml @@ -64,6 +64,13 @@ test-jar test + + org.sunbird.obsrv + dataset-registry + 1.0.0 + test-jar + test + org.apache.flink flink-test-utils @@ -78,9 +85,33 @@ tests - it.ozimov + org.apache.kafka + kafka-clients + ${kafka.version} + test + + + org.apache.kafka + kafka_${scala.maj.version} + ${kafka.version} + test + + + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 + test + + + io.github.embeddedkafka + embedded-kafka_2.12 + 3.4.0 + test + + + io.zonky.test + embedded-postgres + 2.0.3 test @@ -148,7 +179,7 @@ - org.sunbird.obsrv.kafkaconnector.task.KafkaConnectorStreamTask + org.sunbird.obsrv.connector.task.KafkaConnectorStreamTask configList.filter(_.connectorType.equalsIgnoreCase("kafka")).map { dataSourceConfig => - val dataStream: DataStream[String] = - getStringDataStream(env, config, List(dataSourceConfig.connectorConfig.topic), + val dataStream: DataStream[String] = getStringDataStream(env, config, List(dataSourceConfig.connectorConfig.topic), config.kafkaConsumerProperties(kafkaBrokerServers = Some(dataSourceConfig.connectorConfig.kafkaBrokers), kafkaConsumerGroup = Some(s"kafka-${dataSourceConfig.connectorConfig.topic}-consumer")), - consumerSourceName = s"kafka-${dataSourceConfig.connectorConfig.topic}", kafkaConnector) + consumerSourceName = s"kafka-${dataSourceConfig.connectorConfig.topic}", kafkaConnector) val datasetId = dataSourceConfig.datasetId val kafkaOutputTopic = DatasetRegistry.getDataset(datasetId).get.datasetConfig.entryTopic - val resultMapStream: DataStream[String] = dataStream - .filter{msg: String => JSONUtil.isJSON(msg)}.returns(classOf[String]) // TODO: Add a metric to capture invalid JSON messages - .map { streamMap: String => { - val mutableMap = JSONUtil.deserialize[mutable.Map[String, AnyRef]](streamMap) - mutableMap.put("dataset", datasetId) - mutableMap.put("syncts", java.lang.Long.valueOf(new DateTime(DateTimeZone.UTC).getMillis)) - JSONUtil.serialize(mutableMap) + val resultStream: DataStream[String] = dataStream.map { streamData: String => { + val syncts = java.lang.Long.valueOf(new DateTime(DateTimeZone.UTC).getMillis) + JSONUtil.getJsonType(streamData) match { + case "ARRAY" => s"""{"dataset":"$datasetId","syncts":$syncts,"events":$streamData}""" + case _ => s"""{"dataset":"$datasetId","syncts":$syncts,"event":$streamData}""" } + } }.returns(classOf[String]) - resultMapStream.sinkTo(kafkaConnector.kafkaStringSink(kafkaOutputTopic)) + resultStream.sinkTo(kafkaConnector.kafkaSink[String](kafkaOutputTopic)) .name(s"$datasetId-kafka-connector-sink").uid(s"$datasetId-kafka-connector-sink") .setParallelism(config.downstreamOperatorsParallelism) } - env.execute(config.jobName) } } - override def processStream(dataStream: DataStream[String]): DataStream[String] = { - null - } - // $COVERAGE-ON$ } // $COVERAGE-OFF$ Disabling scoverage as the below code can only be invoked within flink cluster @@ -69,4 +68,4 @@ object KafkaConnectorStreamTask { task.process() } } -// $COVERAGE-ON$ +// $COVERAGE-ON$ \ No newline at end of file diff --git a/pipeline/kafka-connector/src/test/resources/test.conf b/pipeline/kafka-connector/src/test/resources/test.conf new file mode 100644 index 00000000..87306136 --- /dev/null +++ b/pipeline/kafka-connector/src/test/resources/test.conf @@ -0,0 +1,14 @@ +include "base-test.conf" + +kafka { + input.topic = "flink.test" + groupId = "flink-kafkaconnector-group" + producer { + max-request-size = 5242880 + } +} + +task { + consumer.parallelism = 1 + downstream.operators.parallelism = 1 +} \ No newline at end of file diff --git a/pipeline/kafka-connector/src/test/scala/org/sunbird/obsrv/connector/KafkaConnectorStreamTestSpec.scala b/pipeline/kafka-connector/src/test/scala/org/sunbird/obsrv/connector/KafkaConnectorStreamTestSpec.scala new file mode 100644 index 00000000..bf86eafa --- /dev/null +++ b/pipeline/kafka-connector/src/test/scala/org/sunbird/obsrv/connector/KafkaConnectorStreamTestSpec.scala @@ -0,0 +1,126 @@ +package org.sunbird.obsrv.connector + +import io.github.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig} +import org.apache.flink.configuration.Configuration +import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.test.util.MiniClusterWithClientResource +import org.apache.kafka.common.serialization.StringDeserializer +import org.scalatest.Matchers._ +import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.connector.task.{KafkaConnectorConfig, KafkaConnectorStreamTask} +import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} +import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry + +import scala.concurrent.ExecutionContext.Implicits.global +import scala.concurrent.Future +import scala.concurrent.duration._ + +class KafkaConnectorStreamTestSpec extends BaseSpecWithDatasetRegistry { + + val flinkCluster = new MiniClusterWithClientResource(new MiniClusterResourceConfiguration.Builder() + .setConfiguration(testConfiguration()) + .setNumberSlotsPerTaskManager(1) + .setNumberTaskManagers(1) + .build) + + val pConfig = new KafkaConnectorConfig(config) + val kafkaConnector = new FlinkKafkaConnector(pConfig) + val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") + implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = + EmbeddedKafkaConfig( + kafkaPort = 9093, + zooKeeperPort = 2183, + customConsumerProperties = customKafkaConsumerProperties + ) + implicit val deserializer: StringDeserializer = new StringDeserializer() + private val VALID_JSON_EVENT = """{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""" + private val VALID_JSON_EVENT_ARRAY = """[{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}]""" + private val INVALID_JSON_EVENT = """{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}""" + + def testConfiguration(): Configuration = { + val config = new Configuration() + config.setString("metrics.reporter", "job_metrics_reporter") + config.setString("metrics.reporter.job_metrics_reporter.class", classOf[BaseMetricsReporter].getName) + config + } + + override def beforeAll(): Unit = { + super.beforeAll() + BaseMetricsReporter.gaugeMetrics.clear() + EmbeddedKafka.start()(embeddedKafkaConfig) + prepareTestData() + createTestTopics() + EmbeddedKafka.publishStringMessageToKafka("d1-topic", VALID_JSON_EVENT) + EmbeddedKafka.publishStringMessageToKafka("d2-topic", VALID_JSON_EVENT_ARRAY) + EmbeddedKafka.publishStringMessageToKafka("d3-topic", INVALID_JSON_EVENT) + + flinkCluster.before() + } + + private def prepareTestData(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + postgresConnect.execute("insert into datasets(id, type, data_schema, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date, tags) values ('d3', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now(), ARRAY['Tag1','Tag2']);") + postgresConnect.execute("insert into dataset_source_config values('sc1', 'd1', 'kafka', '{\"kafkaBrokers\":\"localhost:9093\",\"topic\":\"d1-topic\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.execute("insert into dataset_source_config values('sc2', 'd1', 'rdbms', '{\"type\":\"postgres\",\"tableName\":\"test-table\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.execute("insert into dataset_source_config values('sc3', 'd2', 'kafka', '{\"kafkaBrokers\":\"localhost:9093\",\"topic\":\"d2-topic\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.execute("insert into dataset_source_config values('sc4', 'd3', 'kafka', '{\"kafkaBrokers\":\"localhost:9093\",\"topic\":\"d3-topic\"}', 'Live', null, 'System', 'System', now(), now());") + postgresConnect.closeConnection() + } + + override def afterAll(): Unit = { + super.afterAll() + flinkCluster.after() + EmbeddedKafka.stop() + } + + def createTestTopics(): Unit = { + List( + "d1-topic", "d2-topic", "d3-topic", pConfig.kafkaSystemTopic, "ingest" + ).foreach(EmbeddedKafka.createCustomTopic(_)) + } + + "KafkaConnectorStreamTestSpec" should "validate the kafka connector job" in { + + implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(pConfig) + val task = new KafkaConnectorStreamTask(pConfig, kafkaConnector) + task.process(env) + Future { + env.execute(pConfig.jobName) + } + + val ingestEvents = EmbeddedKafka.consumeNumberMessagesFrom[String]("ingest", 3, timeout = 30.seconds) + validateIngestEvents(ingestEvents) + + pConfig.inputTopic() should be ("") + pConfig.inputConsumer() should be ("") + pConfig.successTag().getId should be ("dummy-events") + pConfig.failedEventsOutputTag().getId should be ("failed-events") + } + + private def validateIngestEvents(ingestEvents: List[String]): Unit = { + ingestEvents.size should be(3) + ingestEvents.foreach{event: String => { + if(event.contains(""""dataset":"d1"""")) { + JSONUtil.getJsonType(event) should be ("OBJECT") + val eventMap = JSONUtil.deserialize[Map[String, AnyRef]](event) + eventMap.get("dataset").get.asInstanceOf[String] should be ("d1") + eventMap.get("syncts").isDefined should be (true) + eventMap.contains("event") should be (true) + } else if(event.contains(""""dataset":"d2"""")) { + JSONUtil.getJsonType(event) should be("OBJECT") + val eventMap = JSONUtil.deserialize[Map[String, AnyRef]](event) + eventMap.get("dataset").get.asInstanceOf[String] should be("d2") + eventMap.get("syncts").isDefined should be(true) + eventMap.contains("events") should be(true) + JSONUtil.getJsonType(JSONUtil.serialize(eventMap.get("events"))) should be("ARRAY") + } else { + JSONUtil.getJsonType(event) should be ("NOT_A_JSON") + event.contains(""""event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}""") should be(true) + } + }} + + } + +} \ No newline at end of file diff --git a/pipeline/master-data-processor/pom.xml b/pipeline/master-data-processor/pom.xml index 52783714..370ec621 100644 --- a/pipeline/master-data-processor/pom.xml +++ b/pipeline/master-data-processor/pom.xml @@ -4,9 +4,6 @@ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 4.0.0 - - 3.0.1 - org.sunbird.obsrv @@ -55,6 +52,11 @@ preprocessor 1.0.0 + + org.sunbird.obsrv.pipeline + denormalizer + 1.0.0 + org.sunbird.obsrv.pipeline transformer @@ -129,9 +131,9 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/pipeline/master-data-processor/src/main/resources/master-data-processor.conf b/pipeline/master-data-processor/src/main/resources/master-data-processor.conf index 686d2f35..149e795b 100644 --- a/pipeline/master-data-processor/src/main/resources/master-data-processor.conf +++ b/pipeline/master-data-processor/src/main/resources/master-data-processor.conf @@ -3,13 +3,14 @@ include "baseconfig.conf" kafka { input.topic = ${job.env}".masterdata.ingest" output.raw.topic = ${job.env}".masterdata.raw" - output.extractor.duplicate.topic = ${job.env}".masterdata.extractor.duplicate" + output.extractor.duplicate.topic = ${job.env}".masterdata.failed" output.failed.topic = ${job.env}".masterdata.failed" output.batch.failed.topic = ${job.env}".masterdata.extractor.failed" event.max.size = "1048576" # Max is only 1MB - output.invalid.topic = ${job.env}".masterdata.invalid" + output.invalid.topic = ${job.env}".masterdata.failed" output.unique.topic = ${job.env}".masterdata.unique" - output.duplicate.topic = ${job.env}".masterdata.duplicate" + output.duplicate.topic = ${job.env}".masterdata.failed" + output.denorm.topic = ${job.env}".masterdata.denorm" output.transform.topic = ${job.env}".masterdata.transform" stats.topic = ${job.env}".masterdata.stats" groupId = ${job.env}"-masterdata-pipeline-group" diff --git a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/function/MasterDataProcessorFunction.scala b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/function/MasterDataProcessorFunction.scala index a9f6c12b..a7ca7471 100644 --- a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/function/MasterDataProcessorFunction.scala +++ b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/function/MasterDataProcessorFunction.scala @@ -3,20 +3,20 @@ package org.sunbird.obsrv.pipeline.function import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction import org.apache.flink.streaming.api.windowing.windows.TimeWindow +import org.json4s.native.JsonMethods._ import org.slf4j.LoggerFactory -import org.sunbird.obsrv.core.streaming.{Metrics, MetricsList, WindowBaseProcessFunction} +import org.sunbird.obsrv.core.model.{ErrorConstants, FunctionalError, Producer} +import org.sunbird.obsrv.core.streaming.Metrics +import org.sunbird.obsrv.core.util.JSONUtil +import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.pipeline.task.MasterDataProcessorConfig import org.sunbird.obsrv.pipeline.util.MasterDataCache import org.sunbird.obsrv.registry.DatasetRegistry -import org.json4s._ -import org.json4s.native.JsonMethods._ -import org.sunbird.obsrv.core.util.JSONUtil +import org.sunbird.obsrv.streaming.BaseDatasetWindowProcessFunction -import java.lang import scala.collection.mutable -import scala.collection.JavaConverters._ -class MasterDataProcessorFunction(config: MasterDataProcessorConfig) extends WindowBaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String](config) { +class MasterDataProcessorFunction(config: MasterDataProcessorConfig) extends BaseDatasetWindowProcessFunction(config) { private[this] val logger = LoggerFactory.getLogger(classOf[MasterDataProcessorFunction]) private[this] var masterDataCache: MasterDataCache = _ @@ -32,41 +32,33 @@ class MasterDataProcessorFunction(config: MasterDataProcessorConfig) extends Win masterDataCache.close() } - override def getMetricsList(): MetricsList = { - val metrics = List(config.successEventCount, config.systemEventCount, config.totalEventCount, config.successInsertCount, config.successUpdateCount, config.failedCount) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + override def getMetrics(): List[String] = { + List(config.successEventCount, config.systemEventCount, config.totalEventCount, config.successInsertCount, config.successUpdateCount) } - override def process(datasetId: String, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: lang.Iterable[mutable.Map[String, AnyRef]], metrics: Metrics): Unit = { - implicit val jsonFormats: Formats = DefaultFormats.withLong + override def processWindow(dataset: Dataset, context: ProcessWindowFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef], String, TimeWindow]#Context, elements: List[mutable.Map[String, AnyRef]], metrics: Metrics): Unit = { - implicit class JsonHelper(json: JValue) { - def customExtract[T](path: String)(implicit mf: Manifest[T]): T = { - path.split('.').foldLeft(json)({ case (acc: JValue, node: String) => acc \ node }).extract[T] + metrics.incCounter(dataset.id, config.totalEventCount, elements.size.toLong) + masterDataCache.open(dataset) + val eventsMap = elements.map(msg => { + val event = JSONUtil.serialize(msg(config.CONST_EVENT)) + val json = parse(event, useBigIntForLong = false) + val node = JSONUtil.getKey(dataset.datasetConfig.key, event) + if (node.isMissingNode) { + markFailure(Some(dataset.id), msg, context, metrics, ErrorConstants.MISSING_DATASET_CONFIG_KEY, Producer.masterdataprocessor, FunctionalError.MissingMasterDatasetKey, datasetType = Some(dataset.datasetType)) } - } - - val eventsList = elements.asScala.toList - metrics.incCounter(datasetId, config.totalEventCount, eventsList.size.toLong) - val dataset = DatasetRegistry.getDataset(datasetId).get - val eventsMap = eventsList.map(msg => { - val json = parse(JSONUtil.serialize(msg(config.CONST_EVENT)), useBigIntForLong = false) - val key = json.customExtract[String](dataset.datasetConfig.key) - if (key == null) { - metrics.incCounter(datasetId, config.failedCount) - context.output(config.failedEventsTag, msg) - } - (key, json) + (node.asText(), json) }).toMap - val validEventsMap = eventsMap.filter(f => f._1 != null) + val validEventsMap = eventsMap.filter(f => f._1.nonEmpty) val result = masterDataCache.process(dataset, validEventsMap) - metrics.incCounter(datasetId, config.successInsertCount, result._1) - metrics.incCounter(datasetId, config.successUpdateCount, result._2) - metrics.incCounter(datasetId, config.successEventCount, eventsList.size.toLong) + metrics.incCounter(dataset.id, config.successInsertCount, result._1) + metrics.incCounter(dataset.id, config.successUpdateCount, result._2) + metrics.incCounter(dataset.id, config.successEventCount, validEventsMap.size.toLong) - eventsList.foreach(event => { + elements.foreach(event => { event.remove(config.CONST_EVENT) - context.output(config.successTag(), markComplete(event, dataset.dataVersion)) + markCompletion(dataset, super.markComplete(event, dataset.dataVersion), context, Producer.masterdataprocessor) }) } -} + +} \ No newline at end of file diff --git a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorConfig.scala b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorConfig.scala index afa42a6d..824edd29 100644 --- a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorConfig.scala +++ b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorConfig.scala @@ -13,28 +13,23 @@ class MasterDataProcessorConfig(override val config: Config) extends BaseJobConf private val serialVersionUID = 2905979434303791379L implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) - // Kafka Topics Configuration - val kafkaStatsTopic: String = config.getString("kafka.stats.topic") - val kafkaFailedTopic: String = config.getString("kafka.output.failed.topic") - // Metric List val totalEventCount = "total-event-count" val successEventCount = "success-event-count" val successInsertCount = "success-insert-count" val successUpdateCount = "success-update-count" - val failedCount = "event-failed-count" val windowTime: Int = config.getInt("task.window.time.in.seconds") val windowCount: Int = config.getInt("task.window.count") - val failedEventsTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed_events") private val statsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("processing_stats") // Functions val masterDataProcessFunction = "MasterDataProcessorFunction" - val failedEventsProducer = "MasterDataFailedEventsProducer" override def inputTopic(): String = config.getString("kafka.input.topic") override def inputConsumer(): String = "master-data-consumer" override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = statsOutputTag + + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") } diff --git a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorStreamTask.scala b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorStreamTask.scala index d1714a4b..7527a6c9 100644 --- a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorStreamTask.scala +++ b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/task/MasterDataProcessorStreamTask.scala @@ -1,18 +1,15 @@ package org.sunbird.obsrv.pipeline.task import com.typesafe.config.{Config, ConfigFactory} -import org.apache.flink.api.common.typeinfo.TypeInformation -import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.api.java.utils.ParameterTool import org.apache.flink.streaming.api.datastream.DataStream import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.api.windowing.time.Time import org.sunbird.obsrv.core.streaming.{BaseStreamTask, FlinkKafkaConnector} import org.sunbird.obsrv.core.util.{DatasetKeySelector, FlinkUtil, TumblingProcessingTimeCountWindows} -import org.sunbird.obsrv.extractor.functions.ExtractionFunction +import org.sunbird.obsrv.denormalizer.task.{DenormalizerConfig, DenormalizerStreamTask} import org.sunbird.obsrv.extractor.task.{ExtractorConfig, ExtractorStreamTask} import org.sunbird.obsrv.pipeline.function.MasterDataProcessorFunction -import org.sunbird.obsrv.preprocessor.functions.EventValidationFunction import org.sunbird.obsrv.preprocessor.task.{PipelinePreprocessorConfig, PipelinePreprocessorStreamTask} import org.sunbird.obsrv.transformer.task.{TransformerConfig, TransformerStreamTask} @@ -39,7 +36,7 @@ class MasterDataProcessorStreamTask(config: Config, masterDataConfig: MasterData /** * Created an overloaded process function to enable unit testing * - * @param env + * @param env StreamExecutionEnvironment */ def process(env: StreamExecutionEnvironment): Unit = { @@ -51,11 +48,14 @@ class MasterDataProcessorStreamTask(config: Config, masterDataConfig: MasterData val extractorTask = new ExtractorStreamTask(new ExtractorConfig(config), kafkaConnector) val preprocessorTask = new PipelinePreprocessorStreamTask(new PipelinePreprocessorConfig(config), kafkaConnector) + val denormalizerTask = new DenormalizerStreamTask(new DenormalizerConfig(config), kafkaConnector) val transformerTask = new TransformerStreamTask(new TransformerConfig(config), kafkaConnector) val transformedStream = transformerTask.processStream( - preprocessorTask.processStream( - extractorTask.processStream(dataStream) + denormalizerTask.processStream( + preprocessorTask.processStream( + extractorTask.processStream(dataStream) + ) ) ) @@ -65,12 +65,7 @@ class MasterDataProcessorStreamTask(config: Config, masterDataConfig: MasterData val processedStream = windowedStream.process(new MasterDataProcessorFunction(masterDataConfig)).name(masterDataConfig.masterDataProcessFunction) .uid(masterDataConfig.masterDataProcessFunction).setParallelism(masterDataConfig.downstreamOperatorsParallelism) - processedStream.getSideOutput(masterDataConfig.failedEventsTag).sinkTo(kafkaConnector.kafkaMapSink(masterDataConfig.kafkaFailedTopic)) - .name(masterDataConfig.failedEventsProducer).uid(masterDataConfig.failedEventsProducer).setParallelism(masterDataConfig.downstreamOperatorsParallelism) - - processedStream.getSideOutput(masterDataConfig.successTag()).sinkTo(kafkaConnector.kafkaMapSink(masterDataConfig.kafkaStatsTopic)) - .name("stats-producer").uid("stats-producer").setParallelism(masterDataConfig.downstreamOperatorsParallelism) - + addDefaultSinks(processedStream, masterDataConfig, kafkaConnector) processedStream.getSideOutput(masterDataConfig.successTag()) } } diff --git a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/util/MasterDataCache.scala b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/util/MasterDataCache.scala index 3449d819..e07f4399 100644 --- a/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/util/MasterDataCache.scala +++ b/pipeline/master-data-processor/src/main/scala/org/sunbird/obsrv/pipeline/util/MasterDataCache.scala @@ -1,12 +1,12 @@ package org.sunbird.obsrv.pipeline.util -import org.json4s.{DefaultFormats, Formats, JNothing, JValue} +import org.json4s.native.JsonMethods._ +import org.json4s.{JNothing, JValue} import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.cache.RedisConnect -import org.sunbird.obsrv.model.DatasetModels.{Dataset, DatasetConfig} +import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.pipeline.task.MasterDataProcessorConfig import redis.clients.jedis.{Pipeline, Response} -import org.json4s.native.JsonMethods._ import scala.collection.mutable @@ -21,11 +21,17 @@ class MasterDataCache(val config: MasterDataProcessorConfig) { def open(datasets: List[Dataset]): Unit = { datasets.map(dataset => { + open(dataset) + }) + } + + def open(dataset: Dataset): Unit = { + if (!datasetPipelineMap.contains(dataset.id)) { val datasetConfig = dataset.datasetConfig val redisConnect = new RedisConnect(datasetConfig.redisDBHost.get, datasetConfig.redisDBPort.get, config.redisConnectionTimeout) val pipeline: Pipeline = redisConnect.getConnection(0).pipelined() datasetPipelineMap.put(dataset.id, pipeline) - }) + } } def process(dataset: Dataset, eventMap: Map[String, JValue]): (Int, Int) = { @@ -48,7 +54,7 @@ class MasterDataCache(val config: MasterDataProcessorConfig) { responses.map(f => (f._1, f._2.get())) } - private def updateCache(dataset: Dataset, dataFromCache: mutable.Map[String, String], eventMap: Map[String, JValue], pipeline: Pipeline ): Unit = { + private def updateCache(dataset: Dataset, dataFromCache: mutable.Map[String, String], eventMap: Map[String, JValue], pipeline: Pipeline): Unit = { pipeline.clear() pipeline.select(dataset.datasetConfig.redisDB.get) eventMap.foreach(f => { diff --git a/pipeline/master-data-processor/src/test/resources/test.conf b/pipeline/master-data-processor/src/test/resources/test.conf index 2fe3f3fb..2c8f0236 100644 --- a/pipeline/master-data-processor/src/test/resources/test.conf +++ b/pipeline/master-data-processor/src/test/resources/test.conf @@ -7,17 +7,17 @@ job { kafka { input.topic = ${job.env}".masterdata.ingest" output.raw.topic = ${job.env}".masterdata.raw" - output.extractor.duplicate.topic = ${job.env}".masterdata.extractor.duplicate" + output.extractor.duplicate.topic = ${job.env}".masterdata.failed" output.failed.topic = ${job.env}".masterdata.failed" - output.batch.failed.topic = ${job.env}".masterdata.extractor.failed" + output.batch.failed.topic = ${job.env}".masterdata.failed" event.max.size = "1048576" # Max is only 1MB - output.invalid.topic = ${job.env}".masterdata.invalid" + output.invalid.topic = ${job.env}".masterdata.failed" output.unique.topic = ${job.env}".masterdata.unique" - output.duplicate.topic = ${job.env}".masterdata.duplicate" + output.duplicate.topic = ${job.env}".masterdata.failed" + output.denorm.topic = ${job.env}".masterdata.denorm" output.transform.topic = ${job.env}".masterdata.transform" stats.topic = ${job.env}".masterdata.stats" groupId = ${job.env}"-masterdata-pipeline-group" - groupId = ${job.env}"-single-pipeline-group" producer { max-request-size = 5242880 } diff --git a/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/fixture/EventFixture.scala b/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/fixture/EventFixture.scala index 77304c8c..e48f8120 100644 --- a/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/fixture/EventFixture.scala +++ b/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/fixture/EventFixture.scala @@ -5,8 +5,6 @@ object EventFixture { val VALID_BATCH_EVENT_D3_INSERT = """{"dataset":"d3","id":"event1","events":[{"code":"HYUN-CRE-D6","manufacturer":"Hyundai","model":"Creta","variant":"SX(O)","modelYear":"2023","price":"2200000","currencyCode":"INR","currency":"Indian Rupee","transmission":"automatic","fuel":"Diesel"}]}""" val VALID_BATCH_EVENT_D3_INSERT_2 = """{"dataset":"d3","id":"event1","events":[{"code":"HYUN-TUC-D6","manufacturer":"Hyundai","model":"Tucson","variant":"Signature","modelYear":"2023","price":"4000000","currencyCode":"INR","currency":"Indian Rupee","transmission":"automatic","fuel":"Diesel"}]}""" val VALID_BATCH_EVENT_D3_UPDATE = """{"dataset":"d3","id":"event1","events":[{"code":"HYUN-CRE-D6","safety":"3 Star (Global NCAP)","seatingCapacity":5}]}""" - val VALID_BATCH_EVENT_D4 = """{"dataset":"d4","event":{"code":"JEEP-CP-D3","manufacturer":"Jeep","model":"Compass","variant":"Model S (O) Diesel 4x4 AT","modelYear":"2023","price":"3800000","currencyCode":"INR","currency":"Indian Rupee","transmission":"automatic","fuel":"Diesel","safety":"5 Star (Euro NCAP)","seatingCapacity":5}}""" - - + val MISSING_DATA_KEY_EVENT_D4 = """{"dataset":"d5","event":{"code1":"JEEP-CP-D3","manufacturer":"Jeep","model":"Compass","variant":"Model S (O) Diesel 4x4 AT","modelYear":"2023","price":"3800000","currencyCode":"INR","currency":"Indian Rupee","transmission":"automatic","fuel":"Diesel","safety":"5 Star (Euro NCAP)","seatingCapacity":5}}""" } diff --git a/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/pipeline/MasterDataProcessorStreamTaskTestSpec.scala b/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/pipeline/MasterDataProcessorStreamTaskTestSpec.scala index ee23eecc..575e2228 100644 --- a/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/pipeline/MasterDataProcessorStreamTaskTestSpec.scala +++ b/pipeline/master-data-processor/src/test/scala/org/sunbird/obsrv/pipeline/MasterDataProcessorStreamTaskTestSpec.scala @@ -9,8 +9,10 @@ import org.apache.kafka.common.serialization.StringDeserializer import org.scalatest.Matchers._ import org.sunbird.obsrv.BaseMetricsReporter import org.sunbird.obsrv.core.cache.RedisConnect +import org.sunbird.obsrv.core.model.ErrorConstants +import org.sunbird.obsrv.core.model.Models.SystemEvent import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector -import org.sunbird.obsrv.core.util.{FlinkUtil, PostgresConnect} +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} import org.sunbird.obsrv.fixture.EventFixture import org.sunbird.obsrv.pipeline.task.{MasterDataProcessorConfig, MasterDataProcessorStreamTask} import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry @@ -57,15 +59,23 @@ class MasterDataProcessorStreamTaskTestSpec extends BaseSpecWithDatasetRegistry EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.VALID_BATCH_EVENT_D3_INSERT_2) EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.VALID_BATCH_EVENT_D4) EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.VALID_BATCH_EVENT_D3_UPDATE) + EmbeddedKafka.publishStringMessageToKafka(config.getString("kafka.input.topic"), EventFixture.MISSING_DATA_KEY_EVENT_D4) flinkCluster.before() } private def insertTestData(postgresConnect: PostgresConnect): Unit = { - postgresConnect.execute("insert into datasets(id, type, extraction_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d3', 'master-dataset', '{\"is_batch_event\": true, \"extraction_key\": \"events\"}', '{\"topic\":\"d3-events\"}', '{\"data_key\":\"code\",\"timestamp_key\":\"date\",\"entry_topic\":\"masterdata.ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":6340,\"redis_db\":3}', 'ACTIVE', 'System', 'System', now(), now());") - postgresConnect.execute("insert into datasets(id, type, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d4', 'master-dataset', '{\"topic\":\"d4-events\"}', '{\"data_key\":\"code\",\"timestamp_key\":\"date\",\"entry_topic\":\"masterdata-ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":6340,\"redis_db\":4}', 'ACTIVE', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, extraction_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d3', 'master-dataset', '{\"is_batch_event\": true, \"extraction_key\": \"events\"}', '{\"topic\":\"d3-events\"}', '{\"data_key\":\"code\",\"timestamp_key\":\"date\",\"entry_topic\":\"masterdata.ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+masterDataConfig.redisPort+",\"redis_db\":3}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d4', 'master-dataset', '{\"topic\":\"d4-events\"}', '{\"data_key\":\"code\",\"timestamp_key\":\"date\",\"entry_topic\":\"masterdata-ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+masterDataConfig.redisPort+",\"redis_db\":4}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d5', 'master-dataset', '{\"topic\":\"d4-events\"}', '{\"data_key\":\"code\",\"timestamp_key\":\"date\",\"entry_topic\":\"masterdata-ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+masterDataConfig.redisPort+",\"redis_db\":4}', 'Live', 'System', 'System', now(), now());") } override def afterAll(): Unit = { + val redisConnection = new RedisConnect(masterDataConfig.redisHost, masterDataConfig.redisPort, masterDataConfig.redisConnectionTimeout) + redisConnection.getConnection(config.getInt("redis.database.extractor.duplication.store.id")).flushAll() + redisConnection.getConnection(config.getInt("redis.database.preprocessor.duplication.store.id")).flushAll() + redisConnection.getConnection(3).flushAll() + redisConnection.getConnection(4).flushAll() + super.afterAll() flinkCluster.after() EmbeddedKafka.stop() @@ -73,7 +83,7 @@ class MasterDataProcessorStreamTaskTestSpec extends BaseSpecWithDatasetRegistry def createTestTopics(): Unit = { List( - config.getString("kafka.stats.topic"), config.getString("kafka.output.transform.topic"), config.getString("kafka.output.duplicate.topic"), + config.getString("kafka.output.system.event.topic"), config.getString("kafka.output.transform.topic"), config.getString("kafka.output.denorm.topic"), config.getString("kafka.output.duplicate.topic"), config.getString("kafka.output.unique.topic"), config.getString("kafka.output.invalid.topic"), config.getString("kafka.output.batch.failed.topic"), config.getString("kafka.output.failed.topic"), config.getString("kafka.output.extractor.duplicate.topic"), config.getString("kafka.output.raw.topic"), config.getString("kafka.input.topic") ).foreach(EmbeddedKafka.createCustomTopic(_)) @@ -88,11 +98,32 @@ class MasterDataProcessorStreamTaskTestSpec extends BaseSpecWithDatasetRegistry env.execute(masterDataConfig.jobName) } - val input = EmbeddedKafka.consumeNumberMessagesFrom[String](config.getString("kafka.stats.topic"), 4, timeout = 30.seconds) - input.size should be (4) + val sysEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](config.getString("kafka.output.system.event.topic"), 8, timeout = 30.seconds) + sysEvents.size should be(8) + + sysEvents.foreach(se => { + val event = JSONUtil.deserialize[SystemEvent](se) + val error = event.data.error + if (event.ctx.dataset.getOrElse("ALL").equals("ALL")) + event.ctx.dataset_type should be(None) + else if (error.isDefined) { + val errorCode = error.get.error_code + if (errorCode.equals(ErrorConstants.MISSING_DATASET_ID.errorCode) || + errorCode.equals(ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode) || + errorCode.equals(ErrorConstants.EVENT_MISSING.errorCode)) { + event.ctx.dataset_type should be(None) + } + } + else + event.ctx.dataset_type should be(Some("master-dataset")) + }) + + val failedEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](masterDataConfig.kafkaFailedTopic, 1, timeout = 30.seconds) + failedEvents.size should be(1) val mutableMetricsMap = mutable.Map[String, Long](); BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### MasterDataProcessorStreamTaskTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) masterDataConfig.successTag().getId should be ("processing_stats") @@ -106,6 +137,9 @@ class MasterDataProcessorStreamTaskTestSpec extends BaseSpecWithDatasetRegistry mutableMetricsMap(s"${masterDataConfig.jobName}.d4.${masterDataConfig.successInsertCount}") should be(1) mutableMetricsMap(s"${masterDataConfig.jobName}.d4.${masterDataConfig.successUpdateCount}") should be(0) + mutableMetricsMap(s"${masterDataConfig.jobName}.d5.${masterDataConfig.totalEventCount}") should be(1) + mutableMetricsMap(s"${masterDataConfig.jobName}.d5.${masterDataConfig.eventFailedMetricsCount}") should be(1) + val redisConnection = new RedisConnect(masterDataConfig.redisHost, masterDataConfig.redisPort, masterDataConfig.redisConnectionTimeout) val jedis1 = redisConnection.getConnection(3) val event1 = jedis1.get("HYUN-CRE-D6") diff --git a/pipeline/pipeline-merged/pom.xml b/pipeline/pipeline-merged/pom.xml index e19bc800..f3db71fe 100644 --- a/pipeline/pipeline-merged/pom.xml +++ b/pipeline/pipeline-merged/pom.xml @@ -4,9 +4,6 @@ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 4.0.0 - - 3.0.1 - org.sunbird.obsrv @@ -134,9 +131,9 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/pipeline/pipeline-merged/src/main/resources/merged-pipeline.conf b/pipeline/pipeline-merged/src/main/resources/merged-pipeline.conf index 7746a8e9..75f43376 100644 --- a/pipeline/pipeline-merged/src/main/resources/merged-pipeline.conf +++ b/pipeline/pipeline-merged/src/main/resources/merged-pipeline.conf @@ -3,15 +3,14 @@ include "baseconfig.conf" kafka { input.topic = ${job.env}".ingest" output.raw.topic = ${job.env}".raw" - output.extractor.duplicate.topic = ${job.env}".extractor.duplicate" - output.failed.topic = ${job.env}".failed" - output.batch.failed.topic = ${job.env}".extractor.failed" + output.extractor.duplicate.topic = ${job.env}".failed" + output.batch.failed.topic = ${job.env}".failed" event.max.size = "1048576" # Max is only 1MB - output.invalid.topic = ${job.env}".invalid" + output.invalid.topic = ${job.env}".failed" output.unique.topic = ${job.env}".unique" - output.duplicate.topic = ${job.env}".duplicate" + output.duplicate.topic = ${job.env}".failed" output.denorm.topic = ${job.env}".denorm" - output.denorm.failed.topic = ${job.env}".denorm.failed" + output.denorm.failed.topic = ${job.env}".failed" output.transform.topic = ${job.env}".transform" stats.topic = ${job.env}".stats" groupId = ${job.env}"-single-pipeline-group" diff --git a/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineConfig.scala b/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineConfig.scala index b37662af..c6df88d3 100644 --- a/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineConfig.scala +++ b/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineConfig.scala @@ -14,19 +14,11 @@ class MergedPipelineConfig(override val config: Config) extends BaseJobConfig[mu implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) // Kafka Topics Configuration - val kafkaInputTopic: String = config.getString("kafka.input.topic") - val kafkaStatsTopic: String = config.getString("kafka.stats.topic") + override def inputTopic(): String = config.getString("kafka.input.topic") - val statsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("processing_stats") - - // Functions - val druidRouterFunction = "DruidRouterFunction" + override def inputConsumer(): String = "pipeline-consumer" - // Producers - val druidRouterProducer = "druid-router-sink" - val processingStatsProducer = "processing-stats-sink" + override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("processing_stats") - override def inputTopic(): String = kafkaInputTopic - override def inputConsumer(): String = "pipeline-consumer" - override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = statsOutputTag + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") } diff --git a/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineStreamTask.scala b/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineStreamTask.scala index 93c8ccca..f7d8dce9 100644 --- a/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineStreamTask.scala +++ b/pipeline/pipeline-merged/src/main/scala/org/sunbird/obsrv/pipeline/task/MergedPipelineStreamTask.scala @@ -9,7 +9,7 @@ import org.sunbird.obsrv.core.util.FlinkUtil import org.sunbird.obsrv.denormalizer.task.{DenormalizerConfig, DenormalizerStreamTask} import org.sunbird.obsrv.extractor.task.{ExtractorConfig, ExtractorStreamTask} import org.sunbird.obsrv.preprocessor.task.{PipelinePreprocessorConfig, PipelinePreprocessorStreamTask} -import org.sunbird.obsrv.router.task.{DruidRouterConfig, DruidRouterStreamTask} +import org.sunbird.obsrv.router.task.{DruidRouterConfig, DynamicRouterStreamTask} import org.sunbird.obsrv.transformer.task.{TransformerConfig, TransformerStreamTask} import java.io.File @@ -34,7 +34,7 @@ class MergedPipelineStreamTask(config: Config, mergedPipelineConfig: MergedPipel /** * Created an overloaded process function to enable unit testing - * @param env + * @param env StreamExecutionEnvironment */ def process(env: StreamExecutionEnvironment): Unit = { @@ -48,7 +48,7 @@ class MergedPipelineStreamTask(config: Config, mergedPipelineConfig: MergedPipel val preprocessorTask = new PipelinePreprocessorStreamTask(new PipelinePreprocessorConfig(config), kafkaConnector) val denormalizerTask = new DenormalizerStreamTask(new DenormalizerConfig(config), kafkaConnector) val transformerTask = new TransformerStreamTask(new TransformerConfig(config), kafkaConnector) - val routerTask = new DruidRouterStreamTask(new DruidRouterConfig(config), kafkaConnector) + val routerTask = new DynamicRouterStreamTask(new DruidRouterConfig(config), kafkaConnector) routerTask.processStream( transformerTask.processStream( diff --git a/pipeline/pipeline-merged/src/test/resources/test.conf b/pipeline/pipeline-merged/src/test/resources/test.conf index 6c8175d1..d2b959c3 100644 --- a/pipeline/pipeline-merged/src/test/resources/test.conf +++ b/pipeline/pipeline-merged/src/test/resources/test.conf @@ -7,15 +7,14 @@ job { kafka { input.topic = ${job.env}".ingest" output.raw.topic = ${job.env}".raw" - output.extractor.duplicate.topic = ${job.env}".extractor.duplicate" - output.failed.topic = ${job.env}".failed" - output.batch.failed.topic = ${job.env}".extractor.failed" + output.extractor.duplicate.topic = ${job.env}".failed" + output.batch.failed.topic = ${job.env}".failed" event.max.size = "1048576" # Max is only 1MB - output.invalid.topic = ${job.env}".invalid" + output.invalid.topic = ${job.env}".failed" output.unique.topic = ${job.env}".unique" - output.duplicate.topic = ${job.env}".duplicate" + output.duplicate.topic = ${job.env}".failed" output.denorm.topic = ${job.env}".denorm" - output.denorm.failed.topic = ${job.env}".denorm.failed" + output.denorm.failed.topic = ${job.env}".failed" output.transform.topic = ${job.env}".transform" stats.topic = ${job.env}".stats" groupId = ${job.env}"-single-pipeline-group" diff --git a/pipeline/pipeline-merged/src/test/scala/org/sunbird/obsrv/pipeline/MergedPipelineStreamTaskTestSpec.scala b/pipeline/pipeline-merged/src/test/scala/org/sunbird/obsrv/pipeline/MergedPipelineStreamTaskTestSpec.scala index 40b31493..f3cf86b2 100644 --- a/pipeline/pipeline-merged/src/test/scala/org/sunbird/obsrv/pipeline/MergedPipelineStreamTaskTestSpec.scala +++ b/pipeline/pipeline-merged/src/test/scala/org/sunbird/obsrv/pipeline/MergedPipelineStreamTaskTestSpec.scala @@ -8,11 +8,14 @@ import org.apache.flink.test.util.MiniClusterWithClientResource import org.apache.kafka.common.serialization.StringDeserializer import org.scalatest.Matchers._ import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.cache.RedisConnect import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector -import org.sunbird.obsrv.core.util.FlinkUtil +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil} +import org.sunbird.obsrv.extractor.task.ExtractorConfig import org.sunbird.obsrv.fixture.EventFixture import org.sunbird.obsrv.pipeline.task.{MergedPipelineConfig, MergedPipelineStreamTask} import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry +import org.sunbird.obsrv.transformer.task.TransformerConfig import scala.collection.mutable import scala.concurrent.ExecutionContext.Implicits.global @@ -28,7 +31,6 @@ class MergedPipelineStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { .build) val mergedPipelineConfig = new MergedPipelineConfig(config) - //val mockKafkaUtil: FlinkKafkaConnector = mock[FlinkKafkaConnector](Mockito.withSettings().serializable()) val kafkaConnector = new FlinkKafkaConnector(mergedPipelineConfig) val customKafkaConsumerProperties: Map[String, String] = Map[String, String]("auto.offset.reset" -> "earliest", "group.id" -> "test-event-schema-group") implicit val embeddedKafkaConfig: EmbeddedKafkaConfig = @@ -63,6 +65,9 @@ class MergedPipelineStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { } override def afterAll(): Unit = { + val redisConnection = new RedisConnect(mergedPipelineConfig.redisHost, mergedPipelineConfig.redisPort, mergedPipelineConfig.redisConnectionTimeout) + redisConnection.getConnection(config.getInt("redis.database.extractor.duplication.store.id")).flushAll() + redisConnection.getConnection(config.getInt("redis.database.preprocessor.duplication.store.id")).flushAll() super.afterAll() flinkCluster.after() EmbeddedKafka.stop() @@ -70,10 +75,11 @@ class MergedPipelineStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { def createTestTopics(): Unit = { List( - config.getString("kafka.stats.topic"), config.getString("kafka.output.transform.topic"), config.getString("kafka.output.denorm.failed.topic"), + config.getString("kafka.output.system.event.topic"), config.getString("kafka.output.transform.topic"), config.getString("kafka.output.denorm.failed.topic"), config.getString("kafka.output.denorm.topic"), config.getString("kafka.output.duplicate.topic"), config.getString("kafka.output.unique.topic"), config.getString("kafka.output.invalid.topic"), config.getString("kafka.output.batch.failed.topic"), config.getString("kafka.output.failed.topic"), - config.getString("kafka.output.extractor.duplicate.topic"), config.getString("kafka.output.raw.topic"), config.getString("kafka.input.topic") + config.getString("kafka.output.extractor.duplicate.topic"), config.getString("kafka.output.raw.topic"), config.getString("kafka.input.topic"), + "d1-events", "d2-events" ).foreach(EmbeddedKafka.createCustomTopic(_)) } @@ -84,20 +90,70 @@ class MergedPipelineStreamTaskTestSpec extends BaseSpecWithDatasetRegistry { task.process(env) Future { env.execute(mergedPipelineConfig.jobName) - Thread.sleep(10000) } - val stats = EmbeddedKafka.consumeNumberMessagesFrom[String](mergedPipelineConfig.kafkaStatsTopic, 1, timeout = 20.seconds) - stats.foreach(Console.println("Event:", _)) + try { + val d1Events = EmbeddedKafka.consumeNumberMessagesFrom[String]("d1-events", 1, timeout = 30.seconds) + d1Events.size should be (1) + val d2Events = EmbeddedKafka.consumeNumberMessagesFrom[String]("d2-events", 1, timeout = 30.seconds) + d2Events.size should be (1) + } catch { + case ex: Exception => ex.printStackTrace() + } + try { + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](config.getString("kafka.output.system.event.topic"), 7, timeout = 30.seconds) + systemEvents.size should be(7) + } catch { + case ex: Exception => ex.printStackTrace() + } val mutableMetricsMap = mutable.Map[String, Long](); BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### MergedPipelineStreamTaskTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + + mutableMetricsMap("ExtractorJob.d1.extractor-total-count") should be(4) + mutableMetricsMap("ExtractorJob.d1.extractor-duplicate-count") should be(1) + mutableMetricsMap("ExtractorJob.d1.extractor-event-count") should be(1) + mutableMetricsMap("ExtractorJob.d1.extractor-success-count") should be(1) + mutableMetricsMap("ExtractorJob.d1.extractor-failed-count") should be(2) + mutableMetricsMap("ExtractorJob.d2.extractor-total-count") should be(2) + mutableMetricsMap("ExtractorJob.d2.failed-event-count") should be(1) + mutableMetricsMap("ExtractorJob.d2.extractor-skipped-count") should be(1) + + mutableMetricsMap("PipelinePreprocessorJob.d1.validator-total-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d1.validator-success-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d1.dedup-total-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d1.dedup-success-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d2.validator-total-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d2.validator-skipped-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d2.dedup-total-count") should be(1) + mutableMetricsMap("PipelinePreprocessorJob.d2.dedup-skipped-count") should be(1) + + mutableMetricsMap("DenormalizerJob.d1.denorm-total") should be(1) + mutableMetricsMap("DenormalizerJob.d1.denorm-failed") should be(1) + mutableMetricsMap("DenormalizerJob.d2.denorm-total") should be(1) + mutableMetricsMap("DenormalizerJob.d2.denorm-skipped") should be(1) + + mutableMetricsMap("TransformerJob.d1.transform-total-count") should be(1) + mutableMetricsMap("TransformerJob.d1.transform-success-count") should be(1) + mutableMetricsMap("TransformerJob.d2.transform-total-count") should be(1) + mutableMetricsMap("TransformerJob.d2.transform-skipped-count") should be(1) + + mutableMetricsMap("DruidRouterJob.d1.router-total-count") should be(1) + mutableMetricsMap("DruidRouterJob.d1.router-success-count") should be(1) + mutableMetricsMap("DruidRouterJob.d2.router-total-count") should be(1) + mutableMetricsMap("DruidRouterJob.d2.router-success-count") should be(1) + + val extractorConfig = new ExtractorConfig(config) + extractorConfig.inputTopic() should be (config.getString("kafka.input.topic")) + extractorConfig.inputConsumer() should be ("extractor-consumer") + + val transformerConfig = new TransformerConfig(config) + transformerConfig.inputTopic() should be(config.getString("kafka.input.topic")) + transformerConfig.inputConsumer() should be("transformer-consumer") - mutableMetricsMap.foreach(println(_)) - //TODO: Add assertions mergedPipelineConfig.successTag().getId should be ("processing_stats") - + mergedPipelineConfig.failedEventsOutputTag().getId should be ("failed-events") } - } diff --git a/pipeline/pom.xml b/pipeline/pom.xml index 2934fa49..25d19b66 100644 --- a/pipeline/pom.xml +++ b/pipeline/pom.xml @@ -5,10 +5,6 @@ http://maven.apache.org/maven-v4_0_0.xsd"> 4.0.0 - - 3.0.0 - - org.sunbird.obsrv pipeline 1.0 @@ -26,7 +22,7 @@ transformer druid-router pipeline-merged - kafka-connector + kafka-connector master-data-processor diff --git a/pipeline/preprocessor/pom.xml b/pipeline/preprocessor/pom.xml index 96171103..1fb410ea 100644 --- a/pipeline/preprocessor/pom.xml +++ b/pipeline/preprocessor/pom.xml @@ -120,9 +120,9 @@ test - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/pipeline/preprocessor/src/main/resources/pipeline-preprocessor.conf b/pipeline/preprocessor/src/main/resources/pipeline-preprocessor.conf index a539195b..7e845e1d 100644 --- a/pipeline/preprocessor/src/main/resources/pipeline-preprocessor.conf +++ b/pipeline/preprocessor/src/main/resources/pipeline-preprocessor.conf @@ -2,10 +2,9 @@ include "baseconfig.conf" kafka { input.topic = ${job.env}".raw" - output.failed.topic = ${job.env}".failed" - output.invalid.topic = ${job.env}".invalid" + output.invalid.topic = ${job.env}".failed" output.unique.topic = ${job.env}".unique" - output.duplicate.topic = ${job.env}".duplicate" + output.duplicate.topic = ${job.env}".failed" groupId = ${job.env}"-pipeline-preprocessor-group" } diff --git a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/DeduplicationFunction.scala b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/DeduplicationFunction.scala index 93522e7e..21e32b2e 100644 --- a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/DeduplicationFunction.scala +++ b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/DeduplicationFunction.scala @@ -5,27 +5,25 @@ import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.ProcessFunction import org.slf4j.LoggerFactory import org.sunbird.obsrv.core.cache._ -import org.sunbird.obsrv.core.model.ErrorConstants +import org.sunbird.obsrv.core.exception.ObsrvException +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model._ import org.sunbird.obsrv.core.streaming._ import org.sunbird.obsrv.core.util.JSONUtil +import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.preprocessor.task.PipelinePreprocessorConfig -import org.sunbird.obsrv.registry.DatasetRegistry +import org.sunbird.obsrv.streaming.BaseDatasetProcessFunction import scala.collection.mutable -class DeduplicationFunction(config: PipelinePreprocessorConfig) - (implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) - extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { +class DeduplicationFunction(config: PipelinePreprocessorConfig)(implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) + extends BaseDatasetProcessFunction(config) with BaseDeduplication { - @transient private var dedupEngine: DedupEngine = null private[this] val logger = LoggerFactory.getLogger(classOf[DeduplicationFunction]) + @transient private var dedupEngine: DedupEngine = null - override def getMetricsList(): MetricsList = { - val metrics = List( - config.duplicationTotalMetricsCount, config.duplicationSkippedEventMetricsCount, config.duplicationEventMetricsCount, - config.duplicationProcessedEventMetricsCount, config.eventFailedMetricsCount - ) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + override def getMetrics(): List[String] = { + List(config.duplicationTotalMetricsCount, config.duplicationSkippedEventMetricsCount, config.duplicationEventMetricsCount, config.duplicationProcessedEventMetricsCount) } override def open(parameters: Configuration): Unit = { @@ -39,25 +37,22 @@ class DeduplicationFunction(config: PipelinePreprocessorConfig) dedupEngine.closeConnectionPool() } - override def processElement(msg: mutable.Map[String, AnyRef], + override def processElement(dataset: Dataset, msg: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - metrics.incCounter(config.defaultDatasetID, config.duplicationTotalMetricsCount) - val datasetId = msg.get(config.CONST_DATASET) - val datasetOpt = DatasetRegistry.getDataset(datasetId.get.asInstanceOf[String]) - val dataset = datasetOpt.get + metrics.incCounter(dataset.id, config.duplicationTotalMetricsCount) val dedupConfig = dataset.dedupConfig if (dedupConfig.isDefined && dedupConfig.get.dropDuplicates.get) { val event = msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]] val eventAsText = JSONUtil.serialize(event) - val isDup = isDuplicate(dataset.id, dedupConfig.get.dedupKey, eventAsText, context, config)(dedupEngine) + val isDup = isDuplicate(dataset, dedupConfig.get.dedupKey, eventAsText, context) if (isDup) { metrics.incCounter(dataset.id, config.duplicationEventMetricsCount) - context.output(config.duplicateEventsOutputTag, markFailed(msg, ErrorConstants.DUPLICATE_EVENT_FOUND, "Deduplication")) + context.output(config.duplicateEventsOutputTag, markFailed(msg, ErrorConstants.DUPLICATE_EVENT_FOUND, Producer.dedup)) } else { metrics.incCounter(dataset.id, config.duplicationProcessedEventMetricsCount) - context.output(config.uniqueEventsOutputTag, markSuccess(msg, "Deduplication")) + context.output(config.uniqueEventsOutputTag, markSuccess(msg, Producer.dedup)) } } else { metrics.incCounter(dataset.id, config.duplicationSkippedEventMetricsCount) @@ -65,4 +60,21 @@ class DeduplicationFunction(config: PipelinePreprocessorConfig) } } + private def isDuplicate(dataset: Dataset, dedupKey: Option[String], event: String, + context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Boolean = { + try { + super.isDuplicate(dataset.id, dedupKey, event)(dedupEngine) + } catch { + case ex: ObsrvException => + val sysEvent = JSONUtil.serialize(SystemEvent( + EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.dedup)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData(error = Some(ErrorLog(pdata_id = Producer.dedup, pdata_status = StatusCode.skipped, error_type = FunctionalError.DedupFailed, error_code = ex.error.errorCode, error_message = ex.error.errorMsg, error_level = ErrorLevel.warn))) + )) + logger.warn("BaseDeduplication:isDuplicate() | Exception", ex) + context.output(config.systemEventsOutputTag, sysEvent) + false + } + } + } diff --git a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/EventValidationFunction.scala b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/EventValidationFunction.scala index 31e659ab..93cfefef 100644 --- a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/EventValidationFunction.scala +++ b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/functions/EventValidationFunction.scala @@ -5,109 +5,153 @@ import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.ProcessFunction import org.slf4j.LoggerFactory -import org.sunbird.obsrv.core.exception.ObsrvException -import org.sunbird.obsrv.core.model.ErrorConstants -import org.sunbird.obsrv.core.model.Models.{PData, SystemEvent} -import org.sunbird.obsrv.core.streaming.{BaseProcessFunction, Metrics, MetricsList} +import org.sunbird.obsrv.core.model.FunctionalError.FunctionalError +import org.sunbird.obsrv.core.model.Models._ +import org.sunbird.obsrv.core.model._ +import org.sunbird.obsrv.core.streaming.Metrics import org.sunbird.obsrv.core.util.JSONUtil import org.sunbird.obsrv.model.DatasetModels.Dataset +import org.sunbird.obsrv.model.{DatasetStatus, ValidationMode} import org.sunbird.obsrv.preprocessor.task.PipelinePreprocessorConfig -import org.sunbird.obsrv.preprocessor.util.SchemaValidator +import org.sunbird.obsrv.preprocessor.util.{SchemaValidator, ValidationMsg} import org.sunbird.obsrv.registry.DatasetRegistry +import org.sunbird.obsrv.streaming.BaseDatasetProcessFunction import scala.collection.mutable -class EventValidationFunction(config: PipelinePreprocessorConfig, - @transient var schemaValidator: SchemaValidator = null) - (implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) - extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { +class EventValidationFunction(config: PipelinePreprocessorConfig)(implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) + extends BaseDatasetProcessFunction(config) { private[this] val logger = LoggerFactory.getLogger(classOf[EventValidationFunction]) - override def getMetricsList(): MetricsList = { - val metrics = List(config.validationTotalMetricsCount, config.validationFailureMetricsCount, - config.validationSuccessMetricsCount, config.validationSkipMetricsCount, config.eventFailedMetricsCount) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + @transient private var schemaValidator: SchemaValidator = null + override def getMetrics(): List[String] = { + List(config.validationTotalMetricsCount, config.validationFailureMetricsCount, config.validationSuccessMetricsCount, + config.validationSkipMetricsCount, config.eventIgnoredMetricsCount) } override def open(parameters: Configuration): Unit = { super.open(parameters) - if (schemaValidator == null) { - schemaValidator = new SchemaValidator(config) - schemaValidator.loadDataSchemas(DatasetRegistry.getAllDatasets(config.datasetType())) - } + schemaValidator = new SchemaValidator() + schemaValidator.loadDataSchemas(DatasetRegistry.getAllDatasets(config.datasetType())) } override def close(): Unit = { super.close() } - override def processElement(msg: mutable.Map[String, AnyRef], - context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + override def processElement(dataset: Dataset, msg: mutable.Map[String, AnyRef], + ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - metrics.incCounter(config.defaultDatasetID, config.validationTotalMetricsCount) - val datasetId = msg.get(config.CONST_DATASET) - if (datasetId.isEmpty) { - context.output(config.failedEventsOutputTag, markFailed(msg, ErrorConstants.MISSING_DATASET_ID, config.jobName)) - metrics.incCounter(config.defaultDatasetID, config.eventFailedMetricsCount) - return - } - val datasetOpt = DatasetRegistry.getDataset(datasetId.get.asInstanceOf[String]) - if (datasetOpt.isEmpty) { - context.output(config.failedEventsOutputTag, markFailed(msg, ErrorConstants.MISSING_DATASET_CONFIGURATION, config.jobName)) - metrics.incCounter(config.defaultDatasetID, config.eventFailedMetricsCount) - return - } - val dataset = datasetOpt.get - if (!super.containsEvent(msg)) { - metrics.incCounter(dataset.id, config.eventFailedMetricsCount) - context.output(config.failedEventsOutputTag, markFailed(msg, ErrorConstants.EVENT_MISSING, config.jobName)) + metrics.incCounter(dataset.id, config.validationTotalMetricsCount) + if (dataset.status != DatasetStatus.Live) { + metrics.incCounter(dataset.id, config.eventIgnoredMetricsCount) return } val validationConfig = dataset.validationConfig if (validationConfig.isDefined && validationConfig.get.validate.get) { - validateEvent(dataset, msg, context, metrics) + schemaValidator.loadDataSchema(dataset) + validateEvent(dataset, msg, ctx, metrics) } else { metrics.incCounter(dataset.id, config.validationSkipMetricsCount) - context.output(config.validEventsOutputTag, markSkipped(msg, "EventValidation")) + ctx.output(config.validEventsOutputTag, markSkipped(msg, Producer.validator)) } } private def validateEvent(dataset: Dataset, msg: mutable.Map[String, AnyRef], - context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - val event = msg(config.CONST_EVENT).asInstanceOf[Map[String, AnyRef]] - try { - if (schemaValidator.schemaFileExists(dataset)) { - val validationReport = schemaValidator.validate(dataset.id, event) - if (validationReport.isSuccess) { - onValidationSuccess(dataset, msg, metrics, context) - } else { - onValidationFailure(dataset, msg, metrics, context, validationReport) + if (schemaValidator.schemaFileExists(dataset)) { + val validationReport = schemaValidator.validate(dataset.id, event) + onValidationResult(dataset, msg, metrics, ctx, validationReport) + } else { + metrics.incCounter(dataset.id, config.validationSkipMetricsCount) + ctx.output(config.validEventsOutputTag, markSkipped(msg, Producer.validator)) + } + } + + private def onValidationResult(dataset: Dataset, event: mutable.Map[String, AnyRef], metrics: Metrics, + ctx: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + validationReport: ProcessingReport): Unit = { + if (validationReport.isSuccess) { + validationSuccess(dataset, event, metrics, ctx) + } else { + val validationFailureMsgs = schemaValidator.getValidationMessages(report = validationReport) + val validationFailureCount = validationFailureMsgs.size + val additionalFieldsCount = validationFailureMsgs.count(f => "additionalProperties".equals(f.keyword)) + if (validationFailureCount == additionalFieldsCount) { + dataset.validationConfig.get.mode.get match { + case ValidationMode.Strict => + validationFailure(dataset, event, metrics, ctx, validationFailureMsgs) + case ValidationMode.IgnoreNewFields => + validationSuccess(dataset, event, metrics, ctx) + case ValidationMode.DiscardNewFields => + // TODO: [P2] Write logic to discard the fields from the pipeline. Fields are anyway discarded from Druid but not from data lake + validationSuccess(dataset, event, metrics, ctx) } + } else { + validationFailure(dataset, event, metrics, ctx, validationFailureMsgs) } - } catch { - case ex: ObsrvException => - metrics.incCounter(dataset.id, config.validationFailureMetricsCount) - context.output(config.failedEventsOutputTag, markFailed(msg, ex.error, "EventValidation")) } } - private def onValidationSuccess(dataset: Dataset, event: mutable.Map[String, AnyRef], metrics: Metrics, - context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Unit = { + private def getSystemEvent(dataset: Dataset, functionalError: FunctionalError, failedCount: Int): String = { + JSONUtil.serialize(SystemEvent(EventID.METRIC, + ctx = ContextData(module = ModuleID.processing, pdata = PData(config.jobName, PDataType.flink, Some(Producer.validator)), dataset = Some(dataset.id), dataset_type = Some(dataset.datasetType)), + data = EData( + error = Some(ErrorLog(pdata_id = Producer.validator, pdata_status = StatusCode.failed, error_type = functionalError, error_code = ErrorConstants.SCHEMA_VALIDATION_FAILED.errorCode, error_message = ErrorConstants.SCHEMA_VALIDATION_FAILED.errorMsg, error_level = ErrorLevel.warn, error_count = Some(failedCount))), + pipeline_stats = None + ) + )) + } + + private def generateSystemEvents(dataset: Dataset, validationFailureMsgs: List[ValidationMsg], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Unit = { + + val reqFailedCount = validationFailureMsgs.count(f => "required".equals(f.keyword)) + val typeFailedCount = validationFailureMsgs.count(f => "type".equals(f.keyword)) + val addTypeFailedCount = validationFailureMsgs.count(f => "additionalProperties".equals(f.keyword)) + val unknownFailureCount = validationFailureMsgs.count(f => !List("type","required","additionalProperties").contains(f.keyword)) + if (reqFailedCount > 0) { + context.output(config.systemEventsOutputTag, getSystemEvent(dataset, FunctionalError.RequiredFieldsMissing, reqFailedCount)) + } + if (typeFailedCount > 0) { + context.output(config.systemEventsOutputTag, getSystemEvent(dataset, FunctionalError.DataTypeMismatch, typeFailedCount)) + } + if (addTypeFailedCount > 0) { + context.output(config.systemEventsOutputTag, getSystemEvent(dataset, FunctionalError.AdditionalFieldsFound, typeFailedCount)) + } + if (unknownFailureCount > 0) { + context.output(config.systemEventsOutputTag, getSystemEvent(dataset, FunctionalError.UnknownValidationError, unknownFailureCount)) + } + + // Log the validation failure messages + validationFailureMsgs.foreach(f => { + f.keyword match { + case "additionalProperties" => + logger.warn(s"SchemaValidator | Additional properties found | dataset=${dataset.id} | ValidationMessage=${JSONUtil.serialize(f)}") + case "required" => + logger.error(s"SchemaValidator | Required Fields Missing | dataset=${dataset.id} | ValidationMessage=${JSONUtil.serialize(f)}") + case "type" => + logger.error(s"SchemaValidator | Data type mismatch found | dataset=${dataset.id} | ValidationMessage=${JSONUtil.serialize(f)}") + case _ => + logger.warn(s"SchemaValidator | Unknown Validation errors found | dataset=${dataset.id} | ValidationMessage=${JSONUtil.serialize(f)}") + } + }) + } + + private def validationSuccess(dataset: Dataset, event: mutable.Map[String, AnyRef], metrics: Metrics, + context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context): Unit = { metrics.incCounter(dataset.id, config.validationSuccessMetricsCount) - context.output(config.validEventsOutputTag, markSuccess(event, "EventValidation")) + context.output(config.validEventsOutputTag, markSuccess(event, Producer.validator)) } - private def onValidationFailure(dataset: Dataset, event: mutable.Map[String, AnyRef], metrics: Metrics, - context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, - validationReport: ProcessingReport): Unit = { - val failedErrorMsg = schemaValidator.getInvalidFieldName(validationReport.toString) + private def validationFailure(dataset: Dataset, event: mutable.Map[String, AnyRef], metrics: Metrics, + context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, + validationFailureMsgs: List[ValidationMsg]): Unit = { metrics.incCounter(dataset.id, config.validationFailureMetricsCount) - context.output(config.invalidEventsOutputTag, markFailed(event, ErrorConstants.SCHEMA_VALIDATION_FAILED, "EventValidation")) - val systemEvent = SystemEvent(PData(config.jobName, "flink", "validation"), Map("error_code" -> ErrorConstants.SCHEMA_VALIDATION_FAILED.errorCode, "error_msg" -> failedErrorMsg)) - context.output(config.systemEventsOutputTag, JSONUtil.serialize(systemEvent)) + context.output(config.invalidEventsOutputTag, markFailed(event, ErrorConstants.SCHEMA_VALIDATION_FAILED, Producer.validator)) + generateSystemEvents(dataset, validationFailureMsgs, context) } } \ No newline at end of file diff --git a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorConfig.scala b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorConfig.scala index 23cc578c..784b92b7 100644 --- a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorConfig.scala +++ b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorConfig.scala @@ -20,36 +20,33 @@ class PipelinePreprocessorConfig(override val config: Config) extends BaseJobCon // Kafka Topic Configuration val kafkaInputTopic: String = config.getString("kafka.input.topic") - val kafkaFailedTopic: String = config.getString("kafka.output.failed.topic") val kafkaInvalidTopic: String = config.getString("kafka.output.invalid.topic") val kafkaUniqueTopic: String = config.getString("kafka.output.unique.topic") val kafkaDuplicateTopic: String = config.getString("kafka.output.duplicate.topic") // Validation & dedup Stream out put tag - val failedEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") val invalidEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("invalid-events") val validEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("valid-events") val uniqueEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("unique-events") val duplicateEventsOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("duplicate-events") // Validation job metrics - val validationTotalMetricsCount = "validation-total-event-count" - val validationSuccessMetricsCount = "validation-success-event-count" - val validationFailureMetricsCount = "validation-failed-event-count" - val eventFailedMetricsCount = "failed-event-count" - val validationSkipMetricsCount = "validation-skipped-event-count" + val validationTotalMetricsCount = "validator-total-count" + val validationSuccessMetricsCount = "validator-success-count" + val validationFailureMetricsCount = "validator-failed-count" + val validationSkipMetricsCount = "validator-skipped-count" + val eventIgnoredMetricsCount = "validator-ignored-count" - val duplicationTotalMetricsCount = "duplicate-total-count" - val duplicationEventMetricsCount = "duplicate-event-count" - val duplicationSkippedEventMetricsCount = "duplicate-skipped-event-count" - val duplicationProcessedEventMetricsCount = "duplicate-processed-event-count" + val duplicationTotalMetricsCount = "dedup-total-count" + val duplicationEventMetricsCount = "dedup-failed-count" + val duplicationSkippedEventMetricsCount = "dedup-skipped-count" + val duplicationProcessedEventMetricsCount = "dedup-success-count" // Consumers val validationConsumer = "validation-consumer" val dedupConsumer = "deduplication-consumer" // Producers - val failedEventProducer = "failed-events-sink" val invalidEventProducer = "invalid-events-sink" val duplicateEventProducer = "duplicate-events-sink" val uniqueEventProducer = "unique-events-sink" @@ -58,5 +55,7 @@ class PipelinePreprocessorConfig(override val config: Config) extends BaseJobCon override def inputConsumer(): String = validationConsumer + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") + override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = uniqueEventsOutputTag } diff --git a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorStreamTask.scala b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorStreamTask.scala index 04b66c8c..fa941d64 100644 --- a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorStreamTask.scala +++ b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/task/PipelinePreprocessorStreamTask.scala @@ -1,7 +1,6 @@ package org.sunbird.obsrv.preprocessor.task import com.typesafe.config.ConfigFactory -import org.apache.flink.api.common.eventtime.WatermarkStrategy import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.api.java.utils.ParameterTool @@ -44,20 +43,21 @@ class PipelinePreprocessorStreamTask(config: PipelinePreprocessorConfig, kafkaCo /** * Sink for invalid events, duplicate events and system events */ - validStream.getSideOutput(config.failedEventsOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaFailedTopic)) + validStream.getSideOutput(config.failedEventsOutputTag()).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaFailedTopic)) .name(config.failedEventProducer).uid(config.failedEventProducer).setParallelism(config.downstreamOperatorsParallelism) - validStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaStringSink(config.kafkaSystemTopic)) + validStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaSink[String](config.kafkaSystemTopic)) .name(config.validationConsumer + "-" + config.systemEventsProducer).uid(config.validationConsumer + "-" + config.systemEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - validStream.getSideOutput(config.invalidEventsOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaInvalidTopic)) + validStream.getSideOutput(config.invalidEventsOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaInvalidTopic)) .name(config.invalidEventProducer).uid(config.invalidEventProducer).setParallelism(config.downstreamOperatorsParallelism) - uniqueStream.getSideOutput(config.duplicateEventsOutputTag).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaDuplicateTopic)) + uniqueStream.getSideOutput(config.duplicateEventsOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaDuplicateTopic)) .name(config.duplicateEventProducer).uid(config.duplicateEventProducer).setParallelism(config.downstreamOperatorsParallelism) - uniqueStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaStringSink(config.kafkaSystemTopic)) + uniqueStream.getSideOutput(config.systemEventsOutputTag).sinkTo(kafkaConnector.kafkaSink[String](config.kafkaSystemTopic)) .name(config.dedupConsumer + "-" + config.systemEventsProducer).uid(config.dedupConsumer + "-" + config.systemEventsProducer).setParallelism(config.downstreamOperatorsParallelism) - uniqueStream.getSideOutput(config.successTag()).sinkTo(kafkaConnector.kafkaMapSink(config.kafkaUniqueTopic)) + uniqueStream.getSideOutput(config.successTag()).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaUniqueTopic)) .name(config.uniqueEventProducer).uid(config.uniqueEventProducer).setParallelism(config.downstreamOperatorsParallelism) + addDefaultSinks(uniqueStream, config, kafkaConnector) uniqueStream.getSideOutput(config.successTag()) } } diff --git a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/util/SchemaValidator.scala b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/util/SchemaValidator.scala index 9682ae71..6d725a1e 100644 --- a/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/util/SchemaValidator.scala +++ b/pipeline/preprocessor/src/main/scala/org/sunbird/obsrv/preprocessor/util/SchemaValidator.scala @@ -9,48 +9,60 @@ import org.sunbird.obsrv.core.exception.ObsrvException import org.sunbird.obsrv.core.model.ErrorConstants import org.sunbird.obsrv.core.util.JSONUtil import org.sunbird.obsrv.model.DatasetModels.Dataset -import org.sunbird.obsrv.preprocessor.task.PipelinePreprocessorConfig import java.io.IOException import scala.collection.mutable -class SchemaValidator(config: PipelinePreprocessorConfig) extends java.io.Serializable { +case class Schema(loadingURI: String, pointer: String) + +case class Instance(pointer: String) + +case class ValidationMsg(level: String, schema: Schema, instance: Instance, domain: String, keyword: String, message: String, allowed: Option[String], + found: Option[String], expected: Option[List[String]], unwanted: Option[List[String]], required: Option[List[String]], missing: Option[List[String]]) + +class SchemaValidator() extends java.io.Serializable { private val serialVersionUID = 8780940932759659175L private[this] val logger = LoggerFactory.getLogger(classOf[SchemaValidator]) private[this] val schemaMap = mutable.Map[String, (JsonSchema, Boolean)]() - def loadDataSchemas(datasets: List[Dataset]) = { + def loadDataSchemas(datasets: List[Dataset]): Unit = { datasets.foreach(dataset => { - if(dataset.jsonSchema.isDefined) { + if (dataset.jsonSchema.isDefined) { try { loadJsonSchema(dataset.id, dataset.jsonSchema.get) } catch { - case ex: ObsrvException => ex.printStackTrace() - schemaMap.put(dataset.id, (null, false)) + case _: ObsrvException => schemaMap.put(dataset.id, (null, false)) } } }) } + def loadDataSchema(dataset: Dataset): Any = { + if (!schemaMap.contains(dataset.id) && dataset.jsonSchema.isDefined) { + try { + loadJsonSchema(dataset.id, dataset.jsonSchema.get) + } catch { + case _: ObsrvException => schemaMap.put(dataset.id, (null, false)) + } + } + } + private def loadJsonSchema(datasetId: String, jsonSchemaStr: String) = { val schemaFactory = JsonSchemaFactory.byDefault try { val jsonSchema = schemaFactory.getJsonSchema(JsonLoader.fromString(jsonSchemaStr)) + jsonSchema.validate(JSONUtil.convertValue(Map("pqr" -> "value"))) // Test validate to check if Schema is valid schemaMap.put(datasetId, (jsonSchema, true)) } catch { case ex: Exception => - logger.error("SchemaValidator:loadJsonSchema() - Exception", ex) + logger.error(s"SchemaValidator:loadJsonSchema() - Unable to parse the schema json for dataset: $datasetId", ex) throw new ObsrvException(ErrorConstants.INVALID_JSON_SCHEMA) } } def schemaFileExists(dataset: Dataset): Boolean = { - - if (dataset.jsonSchema.isEmpty) { - throw new ObsrvException(ErrorConstants.JSON_SCHEMA_NOT_FOUND) - } - schemaMap.get(dataset.id).map(f => f._2).orElse(Some(false)).get + schemaMap.get(dataset.id).map(f => f._2).orElse(Some(false)).get } @throws[IOException] @@ -59,20 +71,13 @@ class SchemaValidator(config: PipelinePreprocessorConfig) extends java.io.Serial schemaMap(datasetId)._1.validate(JSONUtil.convertValue(event)) } - def getInvalidFieldName(errorInfo: String): String = { - val message = errorInfo.split("reports:") - val defaultValidationErrMsg = "Unable to obtain field name for failed validation" - if (message.length > 1) { - val fields = message(1).split(",") - if (fields.length > 2) { - val pointer = fields(3).split("\"pointer\":") - pointer(1).substring(0, pointer(1).length - 1) - } else { - defaultValidationErrMsg - } - } else { - defaultValidationErrMsg - } + def getValidationMessages(report: ProcessingReport): List[ValidationMsg] = { + val buffer = mutable.Buffer[ValidationMsg]() + report.forEach(processingMsg => { + buffer.append(JSONUtil.deserialize[ValidationMsg](JSONUtil.serialize(processingMsg.asJson()))) + }) + buffer.toList } + } // $COVERAGE-ON$ diff --git a/pipeline/preprocessor/src/test/resources/test.conf b/pipeline/preprocessor/src/test/resources/test.conf index dc5734f9..cc68631a 100644 --- a/pipeline/preprocessor/src/test/resources/test.conf +++ b/pipeline/preprocessor/src/test/resources/test.conf @@ -2,10 +2,9 @@ include "base-test.conf" kafka { input.topic = "flink.raw" - output.failed.topic = "flink.failed" - output.invalid.topic = "flink.invalid" + output.invalid.topic = "flink.failed" output.unique.topic = "flink.unique" - output.duplicate.topic = "flink.duplicate" + output.duplicate.topic = "flink.failed" groupId = "flink-pipeline-preprocessor-group" } diff --git a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/PipelinePreprocessorStreamTestSpec.scala b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/PipelinePreprocessorStreamTestSpec.scala index 99ec39ec..d111543b 100644 --- a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/PipelinePreprocessorStreamTestSpec.scala +++ b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/PipelinePreprocessorStreamTestSpec.scala @@ -8,8 +8,11 @@ import org.apache.flink.test.util.MiniClusterWithClientResource import org.apache.kafka.common.serialization.StringDeserializer import org.scalatest.Matchers._ import org.sunbird.obsrv.BaseMetricsReporter +import org.sunbird.obsrv.core.cache.RedisConnect +import org.sunbird.obsrv.core.model.ErrorConstants +import org.sunbird.obsrv.core.model.Models.SystemEvent import org.sunbird.obsrv.core.streaming.FlinkKafkaConnector -import org.sunbird.obsrv.core.util.FlinkUtil +import org.sunbird.obsrv.core.util.{FlinkUtil, JSONUtil, PostgresConnect} import org.sunbird.obsrv.preprocessor.fixture.EventFixtures import org.sunbird.obsrv.preprocessor.task.{PipelinePreprocessorConfig, PipelinePreprocessorStreamTask} import org.sunbird.obsrv.spec.BaseSpecWithDatasetRegistry @@ -49,6 +52,7 @@ class PipelinePreprocessorStreamTestSpec extends BaseSpecWithDatasetRegistry { super.beforeAll() BaseMetricsReporter.gaugeMetrics.clear() EmbeddedKafka.start()(embeddedKafkaConfig) + prepareTestData() createTestTopics() EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.VALID_EVENT) EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.INVALID_EVENT) @@ -57,11 +61,32 @@ class PipelinePreprocessorStreamTestSpec extends BaseSpecWithDatasetRegistry { EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.INVALID_DATASET_EVENT) EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.INVALID_EVENT_KEY) EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.VALID_EVENT_DEDUP_CONFIG_NONE) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.INVALID_EVENT_2) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.EVENT_WITH_ADDL_PROPS_STRICT_MODE) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.EVENT_WITH_ADDL_PROPS_ALLOW_MODE) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.EVENT_WITH_ADDL_PROPS_IGNORE_MODE) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.IGNORED_EVENT) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.EVENT_WITH_UNKNOWN_VALIDATION_ERR) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.EVENT_WITH_EMPTY_SCHEMA) + EmbeddedKafka.publishStringMessageToKafka(pConfig.kafkaInputTopic, EventFixtures.DEDUP_KEY_MISSING) flinkCluster.before() } + private def prepareTestData(): Unit = { + val postgresConnect = new PostgresConnect(postgresConfig) + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d3', 'dataset', '" + """{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"additionalProperties":false,"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"integer"},"deliveriesPromised":{"type":"integer"},"deliveriesDone":{"type":"integer"}},"additionalProperties":false}},"additionalProperties":false,"required":["id","vehicleCode","date"]}""" + "', '{\"validate\": true, \"mode\": \"Strict\"}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Draft', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d4', 'dataset', '" + """{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"additionalProperties":false,"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"integer"},"deliveriesPromised":{"type":"integer"},"deliveriesDone":{"type":"integer"}},"additionalProperties":false}},"additionalProperties":false,"required":["id","vehicleCode","date"]}""" + "', '{\"validate\": true, \"mode\": \"Strict\"}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d5', 'dataset', '" + """{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"additionalProperties":false,"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"integer"},"deliveriesPromised":{"type":"integer"},"deliveriesDone":{"type":"integer"}},"additionalProperties":false}},"additionalProperties":false,"required":["id","vehicleCode","date"]}""" + "', '{\"validate\": true, \"mode\": \"IgnoreNewFields\"}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d6', 'dataset', '" + """{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"id":{"type":"string","maxLength":5},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"additionalProperties":false,"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"integer"},"deliveriesPromised":{"type":"integer"},"deliveriesDone":{"type":"integer"}},"additionalProperties":false}},"additionalProperties":false,"required":["id","vehicleCode","date"]}""" + "', '{\"validate\": true, \"mode\": \"DiscardNewFields\"}', '{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, router_config, dataset_config, status, created_by, updated_by, created_date, updated_date) values ('d7', 'dataset', '"+EventFixtures.INVALID_SCHEMA+"', '{\"validate\": true, \"mode\": \"Strict\"}','{\"topic\":\"d2-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\"}', 'Live', 'System', 'System', now(), now());") + postgresConnect.execute("insert into datasets(id, type, data_schema, validation_config, extraction_config, dedup_config, router_config, dataset_config, status, data_version, created_by, updated_by, created_date, updated_date) values ('d8', 'dataset', '{\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"id\":\"https://sunbird.obsrv.com/test.json\",\"title\":\"Test Schema\",\"description\":\"Test Schema\",\"type\":\"object\",\"properties\":{\"id\":{\"type\":\"string\"},\"vehicleCode\":{\"type\":\"string\"},\"date\":{\"type\":\"string\"},\"dealer\":{\"type\":\"object\",\"properties\":{\"dealerCode\":{\"type\":\"string\"},\"locationId\":{\"type\":\"string\"},\"email\":{\"type\":\"string\"},\"phone\":{\"type\":\"string\"}},\"required\":[\"dealerCode\",\"locationId\"]},\"metrics\":{\"type\":\"object\",\"properties\":{\"bookingsTaken\":{\"type\":\"number\"},\"deliveriesPromised\":{\"type\":\"number\"},\"deliveriesDone\":{\"type\":\"number\"}}}},\"required\":[\"id\",\"vehicleCode\",\"date\",\"dealer\",\"metrics\"]}', '{\"validate\": false, \"mode\": \"Strict\"}', '{\"is_batch_event\": true, \"extraction_key\": \"events\", \"dedup_config\": {\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}}', '{\"drop_duplicates\": true, \"dedup_key\": \"id\", \"dedup_period\": 3}', '{\"topic\":\"d1-events\"}', '{\"data_key\":\"id\",\"timestamp_key\":\"date\",\"entry_topic\":\"ingest\",\"redis_db_host\":\"localhost\",\"redis_db_port\":"+config.getInt("redis.port")+",\"redis_db\":2}', 'Live', 2, 'System', 'System', now(), now());") + postgresConnect.closeConnection() + } + override def afterAll(): Unit = { + val redisConnection = new RedisConnect(pConfig.redisHost, pConfig.redisPort, pConfig.redisConnectionTimeout) + redisConnection.getConnection(config.getInt("redis.database.preprocessor.duplication.store.id")).flushAll() super.afterAll() flinkCluster.after() EmbeddedKafka.stop() @@ -69,8 +94,7 @@ class PipelinePreprocessorStreamTestSpec extends BaseSpecWithDatasetRegistry { def createTestTopics(): Unit = { List( - pConfig.kafkaInputTopic, pConfig.kafkaInvalidTopic, pConfig.kafkaSystemTopic, - pConfig.kafkaDuplicateTopic, pConfig.kafkaUniqueTopic + pConfig.kafkaInputTopic, pConfig.kafkaInvalidTopic, pConfig.kafkaSystemTopic, pConfig.kafkaDuplicateTopic, pConfig.kafkaUniqueTopic ).foreach(EmbeddedKafka.createCustomTopic(_)) } @@ -83,27 +107,114 @@ class PipelinePreprocessorStreamTestSpec extends BaseSpecWithDatasetRegistry { env.execute(pConfig.jobName) Thread.sleep(5000) } - //val extractorFailed = EmbeddedKafka.consumeNumberMessagesFrom[String](config.getString("kafka.input.topic"), 2, timeout = 60.seconds) - val uniqueEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaUniqueTopic, 1, timeout = 60.seconds) - uniqueEvents.foreach(Console.println("Event:", _)) + val outputEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaUniqueTopic, 5, timeout = 30.seconds) + val invalidEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaInvalidTopic, 7, timeout = 30.seconds) + val systemEvents = EmbeddedKafka.consumeNumberMessagesFrom[String](pConfig.kafkaSystemTopic, 8, timeout = 30.seconds) - val mutableMetricsMap = mutable.Map[String, Long](); - val metricsMap = BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + validateOutputEvents(outputEvents) + validateInvalidEvents(invalidEvents) + validateSystemEvents(systemEvents) - mutableMetricsMap(s"${pConfig.jobName}.ALL.${pConfig.validationTotalMetricsCount}") should be (7) - mutableMetricsMap(s"${pConfig.jobName}.ALL.${pConfig.eventFailedMetricsCount}") should be (2) - mutableMetricsMap(s"${pConfig.jobName}.ALL.${pConfig.duplicationTotalMetricsCount}") should be (3) + val mutableMetricsMap = mutable.Map[String, Long]() + BaseMetricsReporter.gaugeMetrics.toMap.mapValues(f => f.getValue()).map(f => mutableMetricsMap.put(f._1, f._2)) + Console.println("### PipelinePreprocessorStreamTestSpec:metrics ###", JSONUtil.serialize(getPrintableMetrics(mutableMetricsMap))) + validateMetrics(mutableMetricsMap) - mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.validationFailureMetricsCount}") should be (1) - mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.duplicationProcessedEventMetricsCount}") should be (1) - mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.duplicationEventMetricsCount}") should be (1) - mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.validationSuccessMetricsCount}") should be (2) + } - mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.duplicationSkippedEventMetricsCount}") should be (1) - mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.validationSkipMetricsCount}") should be (1) - mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.eventFailedMetricsCount}") should be (1) + private def validateOutputEvents(outputEvents: List[String]): Unit = { + outputEvents.size should be(5) + outputEvents.foreach(f => println("OutputEvent", f)) + /* + (OutputEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"obsrv_meta":{"flags":{"validator":"success","dedup":"success"},"syncts":1701772208183,"prevProcessingTime":1701772214928,"error":{},"processingStartTime":1701772214321,"timespans":{"validator":590,"dedup":17}},"dataset":"d1"}) + (OutputEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1235","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"obsrv_meta":{"flags":{"validator":"skipped"},"syncts":1701772208476,"prevProcessingTime":1701772215544,"error":{},"processingStartTime":1701772214544,"timespans":{"validator":1000}},"dataset":"d2"}) + (OutputEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}},"obsrv_meta":{"flags":{"validator":"success"},"syncts":1701772208577,"prevProcessingTime":1701772215613,"error":{},"processingStartTime":1701772214561,"timespans":{"validator":1052}},"dataset":"d5"}) + (OutputEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}},"obsrv_meta":{"flags":{"validator":"success"},"syncts":1701772208597,"prevProcessingTime":1701772215623,"error":{},"processingStartTime":1701772214562,"timespans":{"validator":1061}},"dataset":"d6"}) + (OutputEvent,{"event":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}},"obsrv_meta":{"flags":{"validator":"skipped"},"syncts":1701772208676,"prevProcessingTime":1701772215637,"error":{},"processingStartTime":1701772214563,"timespans":{"validator":1074}},"dataset":"d7"}) + */ + } + private def validateInvalidEvents(invalidEvents: List[String]): Unit = { + invalidEvents.size should be(7) + /* + (invalid,{"event":"{\"event\":{\"id\":\"1234\",\"date\":\"2023-03-01\",\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d1\"}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429101820,"prevProcessingTime":1701429108259,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_PP_1013","error_msg":"Event failed the schema validation"},"processingStartTime":1701429107624,"timespans":{"validator":635}},"dataset":"d1"}) + (invalid,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"1234\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d1\"}","obsrv_meta":{"flags":{"validator":"success","dedup":"failed"},"syncts":1701429101860,"prevProcessingTime":1701429108501,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"dedup"},"error_code":"ERR_PP_1010","error_msg":"Duplicate event found"},"processingStartTime":1701429107625,"timespans":{"validator":873,"dedup":3}},"dataset":"d1"}) + (invalid,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"1234\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}}}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429101886,"prevProcessingTime":1701429108528,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_EXT_1004","error_msg":"Dataset Id is missing from the data"},"processingStartTime":1701429107625,"timespans":{"validator":903}}}) + (invalid,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"1234\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"dX\"}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429101927,"prevProcessingTime":1701429108583,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_EXT_1005","error_msg":"Dataset configuration is missing"},"processingStartTime":1701429107626,"timespans":{"validator":957}},"dataset":"dX"}) + (invalid,{"event1":{"dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"vehicleCode":"HYUN-CRE-D6","id":"1234","date":"2023-03-01","metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}},"event":"{\"event1\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"1234\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d2\"}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429101961,"prevProcessingTime":1701429108586,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_EXT_1006","error_msg":"Event missing in the batch event"},"processingStartTime":1701429107627,"timespans":{"validator":959}},"dataset":"d2"}) + (invalid,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":[\"HYUN-CRE-D6\"],\"id\":1234,\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19}},\"dataset\":\"d4\"}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429102063,"prevProcessingTime":1701429108633,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_PP_1013","error_msg":"Event failed the schema validation"},"processingStartTime":1701429107631,"timespans":{"validator":1002}},"dataset":"d4"}) + (invalid,{"event":"{\"event\":{\"dealer\":{\"dealerCode\":\"KUNUnited\",\"locationId\":\"KUN1\",\"email\":\"dealer1@gmail.com\",\"phone\":\"9849012345\"},\"vehicleCode\":\"HYUN-CRE-D6\",\"id\":\"1234\",\"date\":\"2023-03-01\",\"metrics\":{\"bookingsTaken\":50,\"deliveriesPromised\":20,\"deliveriesDone\":19,\"deliveriesRejected\":1}},\"dataset\":\"d4\"}","obsrv_meta":{"flags":{"validator":"failed"},"syncts":1701429102092,"prevProcessingTime":1701429108661,"error":{"src":{"enumClass":"org.sunbird.obsrv.core.model.Producer","value":"validator"},"error_code":"ERR_PP_1013","error_msg":"Event failed the schema validation"},"processingStartTime":1701429107638,"timespans":{"validator":1023}},"dataset":"d4"}) + */ } + private def validateSystemEvents(systemEvents: List[String]): Unit = { + systemEvents.size should be(8) + + systemEvents.foreach(se => { + val event = JSONUtil.deserialize[SystemEvent](se) + val error = event.data.error + if (event.ctx.dataset.getOrElse("ALL").equals("ALL")) + event.ctx.dataset_type should be(None) + else if (error.isDefined) { + val errorCode = error.get.error_code + if (errorCode.equals(ErrorConstants.MISSING_DATASET_ID.errorCode) || + errorCode.equals(ErrorConstants.MISSING_DATASET_CONFIGURATION.errorCode) || + errorCode.equals(ErrorConstants.EVENT_MISSING.errorCode)) { + event.ctx.dataset_type should be(None) + } + } + else + event.ctx.dataset_type should be(Some("dataset")) + }) + /* + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"d1", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"RequiredFieldsMissing","error_code":"ERR_PP_1013","error_message":"Event failed the schema validation","error_level":"warn","error_count":1}},"ets":1701428460664}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"ALL"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"MissingDatasetId","error_code":"ERR_EXT_1004","error_message":"Dataset Id is missing from the data","error_level":"critical","error_count":1},"pipeline_stats":{"validator_status":"failed","validator_time":874}},"ets":1701428460889}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"dX", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"MissingDatasetId","error_code":"ERR_EXT_1005","error_message":"Dataset configuration is missing","error_level":"critical","error_count":1},"pipeline_stats":{"validator_status":"failed","validator_time":924}},"ets":1701428460927}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"d2", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"MissingEventData","error_code":"ERR_EXT_1006","error_message":"Event missing in the batch event","error_level":"critical","error_count":1},"pipeline_stats":{"validator_status":"failed","validator_time":925}},"ets":1701428460935}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"d4", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"DataTypeMismatch","error_code":"ERR_PP_1013","error_message":"Event failed the schema validation","error_level":"warn","error_count":2}},"ets":1701428460987}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"d4", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"AdditionalFieldsFound","error_code":"ERR_PP_1013","error_message":"Event failed the schema validation","error_level":"warn","error_count":0}},"ets":1701428461010}) + (SysEvent:,{"etype":"METRIC","ctx":{"module":"processing","pdata":{"id":"PipelinePreprocessorJob","type":"flink","pid":"validator"},"dataset":"d6", "dataset_type": "dataset"},"data":{"error":{"pdata_id":"validator","pdata_status":"failed","error_type":"AdditionalFieldsFound","error_code":"ERR_PP_1013","error_message":"Event failed the schema validation","error_level":"warn","error_count":0}},"ets":1701428461064}) + */ + } + + private def validateMetrics(mutableMetricsMap: mutable.Map[String, Long]): Unit = { + mutableMetricsMap(s"${pConfig.jobName}.ALL.${pConfig.eventFailedMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.dX.${pConfig.eventFailedMetricsCount}") should be(1) + + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.validationFailureMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.duplicationProcessedEventMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.duplicationEventMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.validationSuccessMetricsCount}") should be(2) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.validationTotalMetricsCount}") should be(3) + mutableMetricsMap(s"${pConfig.jobName}.d1.${pConfig.duplicationTotalMetricsCount}") should be(2) + + mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.duplicationSkippedEventMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.validationSkipMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.eventFailedMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.validationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d2.${pConfig.duplicationTotalMetricsCount}") should be(1) + + mutableMetricsMap(s"${pConfig.jobName}.d3.${pConfig.validationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d3.${pConfig.eventIgnoredMetricsCount}") should be(1) + + mutableMetricsMap(s"${pConfig.jobName}.d4.${pConfig.validationTotalMetricsCount}") should be(2) + mutableMetricsMap(s"${pConfig.jobName}.d4.${pConfig.validationFailureMetricsCount}") should be(2) + + mutableMetricsMap(s"${pConfig.jobName}.d5.${pConfig.validationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d5.${pConfig.validationSuccessMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d5.${pConfig.duplicationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d5.${pConfig.duplicationSkippedEventMetricsCount}") should be(1) + + mutableMetricsMap(s"${pConfig.jobName}.d6.${pConfig.validationTotalMetricsCount}") should be(2) + mutableMetricsMap(s"${pConfig.jobName}.d6.${pConfig.validationSuccessMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d6.${pConfig.validationFailureMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d6.${pConfig.duplicationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d6.${pConfig.duplicationSkippedEventMetricsCount}") should be(1) + + mutableMetricsMap(s"${pConfig.jobName}.d8.${pConfig.validationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d8.${pConfig.validationSkipMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d8.${pConfig.duplicationTotalMetricsCount}") should be(1) + mutableMetricsMap(s"${pConfig.jobName}.d8.${pConfig.duplicationProcessedEventMetricsCount}") should be(1) + } -} +} \ No newline at end of file diff --git a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/TestSchemaValidator.scala b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/TestSchemaValidator.scala index d5e5e336..0ba13d65 100644 --- a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/TestSchemaValidator.scala +++ b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/TestSchemaValidator.scala @@ -2,9 +2,9 @@ package org.sunbird.obsrv.preprocessor import com.typesafe.config.{Config, ConfigFactory} import org.scalatest.{FlatSpec, Matchers} -import org.sunbird.obsrv.core.exception.ObsrvException import org.sunbird.obsrv.core.util.JSONUtil import org.sunbird.obsrv.model.DatasetModels.{Dataset, DatasetConfig, RouterConfig} +import org.sunbird.obsrv.model.DatasetStatus import org.sunbird.obsrv.preprocessor.fixture.EventFixtures import org.sunbird.obsrv.preprocessor.task.PipelinePreprocessorConfig import org.sunbird.obsrv.preprocessor.util.SchemaValidator @@ -13,12 +13,12 @@ class TestSchemaValidator extends FlatSpec with Matchers { val config: Config = ConfigFactory.load("test.conf") val pipelineProcessorConfig = new PipelinePreprocessorConfig(config) - val schemaValidator = new SchemaValidator(pipelineProcessorConfig) + val schemaValidator = new SchemaValidator() "SchemaValidator" should "return a success report for a valid event" in { - val dataset = Dataset("d1", "dataset", None, None, None, Option(EventFixtures.VALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id","date","ingest"), "Active") - schemaValidator.loadDataSchemas(List(dataset)) + val dataset = Dataset("d1", "dataset", None, None, None, Option(EventFixtures.VALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id","date","ingest"), DatasetStatus.Live) + schemaValidator.loadDataSchema(dataset) val event = JSONUtil.deserialize[Map[String, AnyRef]](EventFixtures.VALID_SCHEMA_EVENT) val report = schemaValidator.validate("d1", event) @@ -27,25 +27,73 @@ class TestSchemaValidator extends FlatSpec with Matchers { it should "return a failed validation report for a invalid event" in { - val dataset = Dataset("d1", "dataset", None, None, None, Option(EventFixtures.VALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id","date","ingest"), "Active") - schemaValidator.loadDataSchemas(List(dataset)) + val dataset = Dataset("d1", "dataset", None, None, None, Option(EventFixtures.VALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id","date","ingest"), DatasetStatus.Live) + schemaValidator.loadDataSchema(dataset) - val event = JSONUtil.deserialize[Map[String, AnyRef]](EventFixtures.INVALID_SCHEMA_EVENT) - val report = schemaValidator.validate("d1", event) - assert(!report.isSuccess) - assert(report.toString.contains("error: object has missing required properties ([\"vehicleCode\"])")) + val event1 = JSONUtil.deserialize[Map[String, AnyRef]](EventFixtures.INVALID_SCHEMA_EVENT) + val report1 = schemaValidator.validate("d1", event1) + val messages1 = schemaValidator.getValidationMessages(report1) + assert(!report1.isSuccess) + assert(messages1.size == 1) + messages1.head.message should be("object has missing required properties ([\"vehicleCode\"])") + messages1.head.keyword should be("required") + messages1.head.missing.get.head should be ("vehicleCode") - val invalidFieldName = schemaValidator.getInvalidFieldName(report.toString) - invalidFieldName should be ("Unable to obtain field name for failed validation") - } + val event2 = JSONUtil.deserialize[Map[String, AnyRef]](EventFixtures.INVALID_SCHEMA_EVENT2) + val report2 = schemaValidator.validate("d1", event2) + val messages2 = schemaValidator.getValidationMessages(report2) + assert(!report2.isSuccess) + assert(messages2.size == 2) + messages2.foreach(f => { + f.found.get match { + case "integer" => + f.message should be("instance type (integer) does not match any allowed primitive type (allowed: [\"string\"])") + f.instance.pointer should be("/id") + case "array" => + f.message should be("instance type (array) does not match any allowed primitive type (allowed: [\"string\"])") + f.instance.pointer should be ("/vehicleCode") + } + }) - it should "validate the negative scenarios" in { - val dataset = Dataset("d1", "dataset", None, None, None, Option(EventFixtures.INVALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id","date","ingest"), "Active") - schemaValidator.loadDataSchemas(List(dataset)) + val event3 = JSONUtil.deserialize[Map[String, AnyRef]](EventFixtures.INVALID_SCHEMA_EVENT3) + val report3 = schemaValidator.validate("d1", event3) + val messages3 = schemaValidator.getValidationMessages(report3) + assert(!report3.isSuccess) + assert(messages3.size == 2) + messages3.foreach(f => { + f.keyword match { + case "type" => + f.message should be("instance type (integer) does not match any allowed primitive type (allowed: [\"string\"])") + f.instance.pointer should be("/id") + f.found.get should be ("integer") + f.expected.get.head should be("string") + case "additionalProperties" => + f.message should be("object instance has properties which are not allowed by the schema: [\"deliveriesRejected\"]") + f.instance.pointer should be("/metrics") + f.unwanted.get.head should be("deliveriesRejected") + } + }) + } - val dataset2 = Dataset("d1", "dataset", None, None, None, None, None, RouterConfig(""), DatasetConfig("id","date","ingest"), "Active") - an[ObsrvException] should be thrownBy schemaValidator.schemaFileExists(dataset2) + it should "validate the negative and missing scenarios" in { + val dataset = Dataset("d4", "dataset", None, None, None, Option(EventFixtures.INVALID_SCHEMA_JSON), None, RouterConfig(""), DatasetConfig("id","date","ingest"), DatasetStatus.Live) + schemaValidator.loadDataSchema(dataset) schemaValidator.schemaFileExists(dataset) should be (false) + + schemaValidator.loadDataSchema(dataset) + schemaValidator.schemaFileExists(dataset) should be(false) + + val dataset2 = Dataset("d5", "dataset", None, None, None, None, None, RouterConfig(""), DatasetConfig("id","date","ingest"), DatasetStatus.Live) + schemaValidator.loadDataSchemas(List[Dataset](dataset2)) + schemaValidator.schemaFileExists(dataset2) should be (false) + + val dataset3 = Dataset("d6", "dataset", None, None, None, Option(EventFixtures.INVALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id", "date", "ingest"), DatasetStatus.Live) + + schemaValidator.loadDataSchemas(List[Dataset](dataset3)) + schemaValidator.schemaFileExists(dataset3) should be(false) + + val dataset4 = Dataset("d7", "dataset", None, None, None, Option(EventFixtures.INVALID_SCHEMA), None, RouterConfig(""), DatasetConfig("id", "date", "ingest"), DatasetStatus.Live) + schemaValidator.schemaFileExists(dataset4) should be (false) } } diff --git a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/fixture/EventFixtures.scala b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/fixture/EventFixtures.scala index ef26b06b..432757bd 100644 --- a/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/fixture/EventFixtures.scala +++ b/pipeline/preprocessor/src/test/scala/org/sunbird/obsrv/preprocessor/fixture/EventFixtures.scala @@ -2,20 +2,32 @@ package org.sunbird.obsrv.preprocessor.fixture object EventFixtures { - val VALID_SCHEMA = """{"$schema":"https://json-schema.org/draft/2020-12/schema","id":"https://sunbird.obsrv.com/test.json","title":"Test Schema","description":"Test Schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"number"},"deliveriesPromised":{"type":"number"},"deliveriesDone":{"type":"number"}}}},"required":["id","vehicleCode","date","dealer","metrics"]}""" - val INVALID_SCHEMA = """{"$schema":"https://json-schema.org/draft/2020-12/schema","id":"https://sunbird.obsrv.com/test.json","title":"Test Schema","description":"Test Schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"number"},"deliveriesPromised":{"type":"number"},"deliveriesDone":{"type":"number"}}}},"required":["id","vehicleCode","date","dealer","metrics"}""" + val VALID_SCHEMA = """{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"additionalProperties":false,"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"integer"},"deliveriesPromised":{"type":"integer"},"deliveriesDone":{"type":"integer"}},"additionalProperties":false}},"additionalProperties":false,"required":["id","vehicleCode","date"]}""" + val INVALID_SCHEMA = """{"$schema":"https://json-schema.org/draft/2020-12/schema","id":"https://sunbird.obsrv.com/test.json","title":"Test Schema","description":"Test Schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"number"},"deliveriesPromised":{"type":"number"},"deliveriesDone":{"type":"number"}}}},"required":["id","vehicleCode","date","dealer","metrics"],"additionalProperties":"false"}""" + val INVALID_SCHEMA_JSON = """{"$schema":"https://json-schema.org/draft/2020-12/schema","id":"https://sunbird.obsrv.com/test.json","title":"Test Schema","description":"Test Schema","type":"object","properties":{"id":{"type":"string"},"vehicleCode":{"type":"string"},"date":{"type":"string"},"dealer":{"type":"object","properties":{"dealerCode":{"type":"string"},"locationId":{"type":"string"},"email":{"type":"string"},"phone":{"type":"string"}},"required":["dealerCode","locationId"]},"metrics":{"type":"object","properties":{"bookingsTaken":{"type":"number"},"deliveriesPromised":{"type":"number"},"deliveriesDone":{"type":"number"}}}},"required":["id","vehicleCode","date","dealer","metrics"}""" val VALID_SCHEMA_EVENT = """{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""" val INVALID_SCHEMA_EVENT = """{"id":"1234","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""" + val INVALID_SCHEMA_EVENT2 = """{"id":1234,"vehicleCode":["HYUN-CRE-D6"],"date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}""" + val INVALID_SCHEMA_EVENT3 = """{"id":1234,"vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}""" val VALID_EVENT = """{"dataset":"d1","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val DEDUP_KEY_MISSING = """{"dataset":"d8","event":{"id1":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val INVALID_EVENT = """{"dataset":"d1","event":{"id":"1234","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val INVALID_EVENT_2 = """{"dataset":"d4","event":{"id":1234,"vehicleCode":["HYUN-CRE-D6"],"date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val EVENT_WITH_ADDL_PROPS_STRICT_MODE = """{"dataset":"d4","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}}""" + val EVENT_WITH_ADDL_PROPS_ALLOW_MODE = """{"dataset":"d5","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}}""" + val EVENT_WITH_ADDL_PROPS_IGNORE_MODE = """{"dataset":"d6","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}}""" + val EVENT_WITH_UNKNOWN_VALIDATION_ERR = """{"dataset":"d6","event":{"id":"123456","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}}""" + val EVENT_WITH_EMPTY_SCHEMA = """{"dataset":"d7","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19,"deliveriesRejected":1}}}""" + val IGNORED_EVENT = """{"dataset":"d3","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val DUPLICATE_EVENT = """{"dataset":"d1","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val MISSING_DATASET_EVENT = """{"event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val INVALID_DATASET_EVENT = """{"dataset":"dX","event":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val INVALID_EVENT_KEY = """{"dataset":"d2","event1":{"id":"1234","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" val VALID_EVENT_DEDUP_CONFIG_NONE = """{"dataset":"d2","event":{"id":"1235","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" + val VALID_EVENT_DRAFT_DATASET = """{"dataset":"d3","event":{"id":"1236","vehicleCode":"HYUN-CRE-D6","date":"2023-03-01","dealer":{"dealerCode":"KUNUnited","locationId":"KUN1","email":"dealer1@gmail.com","phone":"9849012345"},"metrics":{"bookingsTaken":50,"deliveriesPromised":20,"deliveriesDone":19}}}""" diff --git a/pipeline/transformer/pom.xml b/pipeline/transformer/pom.xml index 80d26b82..b695a812 100644 --- a/pipeline/transformer/pom.xml +++ b/pipeline/transformer/pom.xml @@ -62,9 +62,9 @@ tests - it.ozimov + com.github.codemonstur embedded-redis - 0.7.1 + 1.0.0 test diff --git a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/functions/TransformerFunction.scala b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/functions/TransformerFunction.scala index 51750256..fb0da96c 100644 --- a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/functions/TransformerFunction.scala +++ b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/functions/TransformerFunction.scala @@ -2,44 +2,40 @@ package org.sunbird.obsrv.transformer.functions import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.streaming.api.functions.ProcessFunction -import org.sunbird.obsrv.core.streaming.{BaseProcessFunction, Metrics, MetricsList} +import org.sunbird.obsrv.core.model.Producer +import org.sunbird.obsrv.core.streaming.Metrics +import org.sunbird.obsrv.model.DatasetModels.Dataset import org.sunbird.obsrv.registry.DatasetRegistry +import org.sunbird.obsrv.streaming.BaseDatasetProcessFunction import org.sunbird.obsrv.transformer.task.TransformerConfig import scala.collection.mutable class TransformerFunction(config: TransformerConfig)(implicit val eventTypeInfo: TypeInformation[mutable.Map[String, AnyRef]]) - extends BaseProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]](config) { + extends BaseDatasetProcessFunction(config) { - - override def getMetricsList(): MetricsList = { - val metrics = List(config.totalEventCount, config.transformSuccessCount, - config.transformFailedCount, config.transformSkippedCount) - MetricsList(DatasetRegistry.getDataSetIds(config.datasetType()), metrics) + override def getMetrics(): List[String] = { + List(config.totalEventCount, config.transformSuccessCount, config.transformFailedCount, config.transformSkippedCount) } /** * Method to process the event transformations */ - override def processElement(msg: mutable.Map[String, AnyRef], + override def processElement(dataset: Dataset, msg: mutable.Map[String, AnyRef], context: ProcessFunction[mutable.Map[String, AnyRef], mutable.Map[String, AnyRef]]#Context, metrics: Metrics): Unit = { - val datasetId = msg(config.CONST_DATASET).asInstanceOf[String] // DatasetId cannot be empty at this stage - metrics.incCounter(datasetId, config.totalEventCount) - - val datasetTransformations = DatasetRegistry.getDatasetTransformations(datasetId) - if(datasetTransformations.isDefined) { + metrics.incCounter(dataset.id, config.totalEventCount) + val datasetTransformations = DatasetRegistry.getDatasetTransformations(dataset.id) + if (datasetTransformations.isDefined) { // TODO: Perform transformations - metrics.incCounter(datasetId, config.transformSuccessCount) - context.output(config.transformerOutputTag, markSuccess(msg, config.jobName)) + metrics.incCounter(dataset.id, config.transformSuccessCount) + context.output(config.transformerOutputTag, markSuccess(msg, Producer.transformer)) } else { - metrics.incCounter(datasetId, config.transformSkippedCount) - context.output(config.transformerOutputTag, markSkipped(msg, config.jobName)) + metrics.incCounter(dataset.id, config.transformSkippedCount) + context.output(config.transformerOutputTag, markSkipped(msg, Producer.transformer)) } - } -} - +} \ No newline at end of file diff --git a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerConfig.scala b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerConfig.scala index 24dc4292..797b3e56 100644 --- a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerConfig.scala +++ b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerConfig.scala @@ -15,12 +15,11 @@ class TransformerConfig(override val config: Config) extends BaseJobConfig[mutab implicit val mapTypeInfo: TypeInformation[mutable.Map[String, AnyRef]] = TypeExtractor.getForClass(classOf[mutable.Map[String, AnyRef]]) // Metric List - val totalEventCount = "total-event-count" + val totalEventCount = "transform-total-count" val transformSuccessCount = "transform-success-count" val transformFailedCount = "transform-failed-count" val transformSkippedCount = "transform-skipped-count" - val kafkaInputTopic: String = config.getString("kafka.input.topic") val kafkaTransformTopic: String = config.getString("kafka.output.transform.topic") val transformerFunction = "transformer-function" @@ -29,9 +28,11 @@ class TransformerConfig(override val config: Config) extends BaseJobConfig[mutab private val TRANSFORMER_OUTPUT_TAG = "transformed-events" val transformerOutputTag: OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]](TRANSFORMER_OUTPUT_TAG) - override def inputTopic(): String = kafkaInputTopic + override def inputTopic(): String = config.getString("kafka.input.topic") override def inputConsumer(): String = "transformer-consumer" override def successTag(): OutputTag[mutable.Map[String, AnyRef]] = transformerOutputTag + + override def failedEventsOutputTag(): OutputTag[mutable.Map[String, AnyRef]] = OutputTag[mutable.Map[String, AnyRef]]("failed-events") } diff --git a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerStreamTask.scala b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerStreamTask.scala index f14771bf..71e86581 100644 --- a/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerStreamTask.scala +++ b/pipeline/transformer/src/main/scala/org/sunbird/obsrv/transformer/task/TransformerStreamTask.scala @@ -1,7 +1,6 @@ package org.sunbird.obsrv.transformer.task import com.typesafe.config.ConfigFactory -import org.apache.flink.api.common.eventtime.WatermarkStrategy import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.java.typeutils.TypeExtractor import org.apache.flink.api.java.utils.ParameterTool @@ -36,10 +35,10 @@ class TransformerStreamTask(config: TransformerConfig, kafkaConnector: FlinkKafk val transformedStream = dataStream.process(new TransformerFunction(config)).name(config.transformerFunction).uid(config.transformerFunction) .setParallelism(config.downstreamOperatorsParallelism) - transformedStream.getSideOutput(config.transformerOutputTag) - .sinkTo(kafkaConnector.kafkaMapSink(config.kafkaTransformTopic)) - .name(config.transformerProducer).uid(config.transformerProducer) - .setParallelism(config.downstreamOperatorsParallelism) + transformedStream.getSideOutput(config.transformerOutputTag).sinkTo(kafkaConnector.kafkaSink[mutable.Map[String, AnyRef]](config.kafkaTransformTopic)) + .name(config.transformerProducer).uid(config.transformerProducer).setParallelism(config.downstreamOperatorsParallelism) + + addDefaultSinks(transformedStream, config, kafkaConnector) transformedStream.getSideOutput(config.successTag()) } } diff --git a/pom.xml b/pom.xml index 86a0cc83..c8f53bd8 100644 --- a/pom.xml +++ b/pom.xml @@ -5,10 +5,6 @@ http://maven.apache.org/maven-v4_0_0.xsd"> 4.0.0 - - 3.0.0 - - org.sunbird.obsrv core 1.0