Main motivation is to try Apache Flink in the context of ETL task and to understand system abilities in general.
Consume events and prepare personal archives for long-term storing purposes and effective extraction.
- Archives are organized by time range (weekly archives, for example)
- Archives have main person (i.e. all in/out communication of person for data range)
- Events should be searchable, so that we know what are participants and dates of the events we are potentially interested Having this we could effectively extract all communication of concrete employee for concrete dates from deep archives like AWS Glacier.
This sample requires
- java (checked with 1.8)
- docker, docker compose
- gradle
- Clone the repo
- Download Enron email dataset
- Open the project in IDE like IntellijIdea
- Run containers
sudo docker-compose up -d
fromdocker
folder. This would start- Minio - for result archives, flink checkpoints and big payloads storing. S3-compatible
- Elasticsearch - for data index
- DejaVu - viewer for elasticsearch
- Run CommsEtlJob class with java option
-DSOURCE_FOLDER=./enron
to target the directory with extracted Enron dataset - Check flink dashboard at http://localhost:8081/
- Check the built index at http://localhost:1359/?appname=data&url=http://localhost:9200&mode=edit
- Check prepared archives at http://localhost:9000/minio/archived/
- Add tests...Basically implemented logic is very simple, but would be great to have at least integration test to check overall job correctness.
- Introduce simple dataset generator and check with more events in cluster mode
- Add more event types (with events hierarchy and all these things)
- Handle attachments
- Make some heavy parts of payload (like heavy attachments) lazy - fetch them from object storage only by request
- Add some analytics pipelines (try detect anomalies, build a social graph etc - dataset doesn't contain really significant emails, but still, the idea why this dataset is public is clear Enron scandal)
Enron dataset (about 1.7G unarchived) can be processed locally in less than 10 minutes. It doesn't say much though )
Flink looks pretty good for such type of tasks, however the fact that events are of arbitrary size could bring complexities, but approach like reference based messaging solves this problem. Even complex and task-specific logic could be introduced (calling different storages during processing, creating many data flows with own sinks etc), but one day it might become a source of huge performance and logical issues. On the other hand limitations of the paradigm would force you to think twice before introducing complex, potentially unnecessary logic, or avoid bringing to the pipeline things that can be done outside of the pipeline scope.
Contribution is welcome: starting from refactoring and improvements and ending with new jobs for analysis of archived events.
Distributed under the MIT License. See LICENSE
for more information.