AWS Kinesis

Kinesis is a managed alternative to Apache Kafka
It is a big data stream tool, which allows to stream application logs, metrics, IoT data, click streams, etc.
Compatible with many streaming frameworks (Spark, NiFi, etc.)
Data is automatically replicated to 3 AZ
Kineses offers 3 types of products:
- Kinesis Streams: low latency streaming ingest at scale
- Kinesis Analytics: perform real-time analytics on streams using SQL
- Kinesis Firehose: load streams into S3, Redshift, ElasticSearch

Kinesis Streams Overview

Streams are divided in ordered Shards/Partitions
For higher throughput we can increase the size of the shards
Data retention is 1 day by default, can go up to 7 days
Kinesis Streams provides the ability to reprocess/replay the data
Multiple applications can consume the same stream, this enables real-time processing with scale of throughput
Kinesis is not a database, once the data is inserted, it can not be deleted

Kinesis Stream Shards

One stream is made of many different shards
1MB/s or 1000 messages at write PER SHARD
2MB/s read PER SHARD
Billing is done per shard provisioned, we can have a many shards as we want as long as we accept the cost
Ability to batch the messages per calls
The number of shards can evolve over time (reshard/merge)
Records are ordered per shard!

Kinesis Streams API - Put Records

Data must be sent form the PutRecords API to a partition key
Data with the same key goes to the same partition (helps with ordering for a specific key)
Messages sent get a sequence number
Partition key must be highly distributed in order to avoid hot partitions
In order to reduce costs, we can use batching with PutRecords API
It the limits are reached, we get a ProvisionedThroughputException

Exceptions

ProvisionedThroughputException Exceptions:
- Happens when the data value exceeds the limit exposed by the shard
- In order to avoid the, we have to make sure we don't have hot partitions
- Solutions:
  - Retry with back-off
  - Increase shards (scaling)
  - Ensure the partition key is good

Consumers

Consumers can use CLI or SDK, or the Kinesis Client Library (in Java, Node, Python, Ruby, .Net)
Kinesis Client Library (KCL) uses DynamoDB to checkpoint offsets
KCL uses DynamoDB to track other workers and share work amongst shards

Security

Control access / authorization using IAM policies
Encryption in flight using HTTPS endpoints
Encryption at rest using KMS
Possibility to encrypt/decrypt data client side
VPC Endpoints available for Kinesis to be access within VPCs

Kinesis Data Firehose

Fully managed service, no administration required, provides automatic scaling, it is basically serverless
Used for load data into Redshift, S3, ElasticSearch and Splunk
It is Near Real Time: 60 seconds latency minimum for non full batches or minimum 32 MB of data at a time
Supports many data formats, conversions, transformation and compression
Pay for the amount of data going through Firehose

Kinesis Data Streams vs Firehose

Streams:
- Requires to write custom code (producer/consumer)
- Real time (~200 ms)
- Must manage scaling (shard splitting / merging)
- Can store data into stream, data can be stored from 1 to 7 days
- Data can be read by multiple consumers
Firehose:
- Fully managed, sends data to S3, Redshift, Splunk, ElasticSearch
- Serverless, data transformation can be done with Lambda
- Near real time
- Scales automatically
- It provides no data storage

Kinesis Data Analytics

Can take data from Kinesis Data Streams and Kinesis Firehose and perform some queries on it
It can perform real-time analytics using SQL
Kinesis Data Analytics properties:
- Automatically scales
- Managed: no servers to provision
- Continuous: analytics are done in real time
Pricing: pay per consumption rate
It can create streams out of real-time queries

Data Ordering with Kinesis

Data with the same partition key goes to the same shard
Data is ordered per shard

SQS vs SNS vs Kinesis

SQS:
- Consumers pull data
- Data is deleted after being consumed
- Can have many consumers as we want
- No need to provision throughput
- No ordering guarantee in case of standard queues
- Capability to delay individual messages
SNS:
- Pub/Sub: publish data to many subscribers
- We can have up to 10 million subscribers per topic
- Data is not persisted (it is lost if not delivered)
- Up to 10k topics per account
- No need to provision throughput
- Integrates with SQS for fan-out architecture
Kinesis Data Streams:
- Consumers "pull data"
- We can have as many consumers as we want
- Possibility to replay data
- Recommended for real-time big data analytics and ETL
- Ordering happens at the shard level
- Data expires after X days
- Must provision throughput

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kinesis.md

kinesis.md

AWS Kinesis

Kinesis Streams Overview

Kinesis Stream Shards

Kinesis Streams API - Put Records

Exceptions

Consumers

Security

Kinesis Data Firehose

Kinesis Data Streams vs Firehose

Kinesis Data Analytics

Data Ordering with Kinesis

SQS vs SNS vs Kinesis

Files

kinesis.md

Latest commit

History

kinesis.md

File metadata and controls

AWS Kinesis

Kinesis Streams Overview

Kinesis Stream Shards

Kinesis Streams API - Put Records

Exceptions

Consumers

Security

Kinesis Data Firehose

Kinesis Data Streams vs Firehose

Kinesis Data Analytics

Data Ordering with Kinesis

SQS vs SNS vs Kinesis