AWS - Kinesis

February 19, 2022

Makes it easy to collect, process and analyse streaming data in real-time
Kinesis Data Streams: Capture, process and store data streams
Kinesis Data Firehose: Load data streams into AWS data stores
Kinesis Data Analysis: Analyze data streams with SQL or Apache flink

Kinesis Data Streams

Billing per shard provisioned, as many shards as you want
Retention between 1 day or 365 days
Ability to reprocess data
Once data is inserted into Kinesis, it cannot be deleted
Data that shares the same partition goes to the same shard

Consumers

Write your own consumer: Kinesis Client Library, AWS SDK
Managed by AWS: Lambda, Data firehose, Data analytics

Kinesis Data Stream Security

Control access/authentication using IAM
Encryption in flight using HTTPS endpoints
Encryption at rest using KMS
Can implement encryption/decryption on client side
Monitor API calls in Cloudtrail

Kinesis Producers

Puts data records into data streams
Data record consists of: Sequence no (unique per partition-key within shard)
Partition key (must specify)
Data blob (up to 1MB)
Can use AWS SDK, Kinesis Producer Library (KPL) or Kinesis Agent
Write throughput: 1MB/sec or 1000 records/sec per shard
Use PutRecord API
Use batching with PutRecord to reduce costs and increase throughput
Use highly distributed partition key to avoid ‘hot partition’
Can get ProvisionThroughputExceeded error. To solve: Use highly distributed partition key, retry with exponential backoff or increase shards (scaling)
Same partition key = same shard

Kinesis Consumers

Get data records from data streams and process them
Lambda, Firehose, Analytics, Custom Consumer (AWS SDK), Kinesis Client Library

Consumer types

Shared (classic) fan-out consumer - pull

Low no. of consuming applications
Read throughput: 2MB/sec per shard across all consumers
Max 5 GetRecords API calls/sec
Latency approx. 200ms
Minimize costs
Consumers poll data from Kinesis using GetRecords API call
Returns up to 10MB or up to 10000 records

Enhanced fan-out consumer - push

Multiple consuming applications for the same stream
2MB/sec per consumer per shard
Latency approx. 70 ms
Higher cost
Kinesis pushes data to consumers over HTTP/2 using SubscribeToShard API
Soft limit of 5 consumer applications (KCL) per data stream

Lambda

Supports both classic and enhanced
Reads records in batches
Can configure batch window and batch size
If error occurs, Lambda retries until successful or data expires
Can process up to 10 batches per shard simultaneously

Kinesis Client Library

Java library that helps read record from a Kinesis Data Stream with ditributed applications sharing the read workload
Each shard is to be read by only one KCL instance eg. 4 shards = 4 KCL instances
Progress check pointed into DynamoDB (needs IAM access)
Track other workers and share work amongst shards using DynamoDB
KCL can run on EC2, Elastic Beanstalk and on-premises
Version 1.x (supports shared consumer)
Version 2.x (supports shared and enhanced fan-out consumer)

Kinesis Operation - Shard splitting

Used to increase stream capacity (1MB/s data in shard)
Used to divide a hot shard
Old shard is closed and will be deleted once data is expired
No auto. scaling (manually increase/decrease capacity)
Can’t split into more than 2 shards in a single operation

Kinesis Operation - Merging shards

Decrease stream capacity and decrease costs
Can be used to group two shards with low traffic (cold shards)
Old shards are closed and will be deleted once data is expired
Can’t merge more than two shards in a single operation

Kinesis Data Firehose

Fully managed service, no admin, auto scaling and serverless
Pay for data going through firehose
Near real-time: 60 secs latency for non full batches, or minimum 32MB of data at a time
Supports many data formats, conversations, transformations, compression
Supports custom data transformations using AWS Lambda
Can send failed or all data to a backup S3 bucket

Kinesis Data streams vs Firehose

Data streams

Streaming service for ingest at scale
Write custom code (producer/consumer)
Real-time (approx. 200 ms)
Managed scaling (shard splitting/merging)
Data storage for 1-365 days
Supports replay capability

Firehose

Load streaming data into S3/Redshift/3rd Party
Fully managed
Near real-time (buffer time min. 60 secs)
Automatic scaling
No data storage
Does not support replay capability

Kinesis Data Analytics

Perform real-time analysis on Kinesis Streams using SQL
Fully managed, no servers to provision
Automatic scaling
Realtime analytics
Pay for consumption rate
Can create streams out of real-time queries
Use cases: time-series analysis, real-time dashboards, real-time metrics

Ordering data into Kinesis

The same key will always go to the same shard

Kinesis vs SQS ordering

Lets assume you have 100 users, 5 Kinesis shards and 1 SQS FIFO queue

Kinesis Data streams

On average you will have 20 users per shard
Users will have their data ordered within each shard
The max. amount of consumers we can have in parallel is 5
Can receive up to 5MB/s of data

SQS FIFO

You only have one SQS FIFO queue
Will have 100 group IDs
You can have up to 100 consumers, due to 100 group IDs
Can have up to 300 messages/sec or 3000 if using batching

SQS vs Kinesis vs SNS

SQS

Consumers ‘pull data’
Data deleted after consumed
Can have as many consumers as you need
No need to provision throughput
Ordering guarantees only FIFO queue
Individual message delay capability

SNS

Push data to many subscribers
Up to 12500000 subscribers
Data is not persisted (lost if not delivered)
Publish/Subscribe model
Up to 1000 topics
No need to provision throughput
Integrates with SQS for fan-out architecture approach
FIFO capability for SQS FIFO

Kinesis

Standard: pull data - 2MB per shard
Enhanced: fan-out - 2MB per shard per consumer
Possible to replay data
Meant for real-time big data, analytics and ETL
Ordering at shard level
Data expires after X days
Must provision throughput