Processing real-time data via. Kinesis Data Analytics - Apache Flink
Youtube video(s)
- Send Data to Kinesis from a Python Script
- Optional - Send Data to Kinesis from a KDA Notebook
- Create a Kinesis Data Analytics Studio and Upload a Notebook
- Running the Interactive Flink Zeppelin Notebook
- Deploy a Kinesis Data Analytics Studio Notebook
Note if you want to get started and do not want to set up a Kinesis Data Stream & load data into the stream / set up a data simulator, use the sql_1.13_DataGen.zpln notebook. This Zeppelin notebook uses the Flink DataGen connector to generate data with in the Zeppelin notebook without needing a connnection to Kineis or Kafka.
In order to get started with Apache Flink via. Kinesis Data Analytics (KDA), a Kinesis Data Stream with sample data is required. The kinesis_data_producer
folder provides two python scripts that will read the data from the CSV file yellow_tripdata_2020-01.csv
in the data
folder and stream each line in the file as a JSON record/message to a Kineis Data Stream specified.
Two variations of this python data producer are provided.
The two scripts/programs are very similar. A few differences exist depending on if you want run the producer application(s) from your local computer/laptop or if you want to use Cloud9.
For a step by step walk through view the Youtube video Send Data to Kinesis from a Python Script
An alternative method to send sample data to a Kinesis Data Stream - without the need to set up the python data producer(s) described above - is to use the Nyc_Taxi_Produce_KDA_Zeppelin_Notebook.zpln
notebook in KDA Studio. This notebook can be uploaded and has instructions to sends sample data from S3 to a Kinesis Data Stream.
To benefit the most from the sample Flink code / labs provided it will be important that you can easily start and stop a python data producer.
The interactive_KDA_flink_zeppelin_notebook
folder provides Zeppelin notebooks that are design to work with Kinesis Data Analytics Studio. Deploy a Kinesis Data Analytics Studio instance and upload the Zeppelin (.zpln) notebook(s).
Note - with in the the interactive_KDA_flink_zeppelin_notebook
folder are subfolders
Depending on which version of Flink your notebook is configured to use. I would recommend using Flink v1.13.
To upload the notebook
Once uploaded and opended in Zeppelin. Run the notebook one cell at a time
For a step by step walk through of the notebook running view the Youtube video Running the Interactive Flink Zeppelin Notebook
Kinesis Data Analytics Studio provides an excellent development environment. When you are ready to deploy you application Kinesis Data Analytics Studio has a mechanism to build and deploy your notebook code as a long running Kinesis Data Analytics application.
To deploy your notebook
Ensure that when you created your notebook environment you configured the Deploy as application configuration - optional
setting with a valid S3 bucket.
To access this configuration menu during the creation of your studio notebook select Create with custom settings
instead of the default Quick create with sample code. Follow the set up prompts and on Step 3 - Configure
select an S3 bucket for the Deploy as application configuration - optional
With this configured your Zeppelin notebook select Build deployable and export to Amazon S3
Once the build is complete. Select Deploy deployable as Kinesis Analytics application
When the deployment is complete you will see the application under the analytics application section of Kinesis Data Analytics
- YouTube video - DataGen based interactive_KDA_flink_zeppelin_notebook sql_1.13_DataGen.zpln
- Versioned Tables
- Examples for Managed Streaming for Kafka (MSK)