System Designer

What is Apache Flink?

Apache Flink is an open-source, unified stream and batch processing framework for distributed, high-performing, always-available, and accurate data streaming applications. Originally developed at Berlin Technical University and later donated to Apache Software Foundation, Flink excels at processing infinite data streams with millisecond latencies and exactly-once processing guarantees.

Unlike traditional batch systems, Flink treats streaming as the primary computation model and batch as a special case of streaming. It supports event-time processing, stateful computations, and provides advanced windowing capabilities, making it ideal for real-time analytics, fraud detection, monitoring, and IoT applications.

Flink Stream Processing Calculator

Throughput: 100,000 events/sec

Window Size: 5 minutes

Parallelism: 4 operators

Checkpoint Interval: 5 seconds

Event Size: 1 KB

25,000

Events/sec per operator

175782MB

Total Memory

501ms

End-to-end Latency

35156MB

Checkpoint Size

Network: 2400 Mbps required

Window Events: 30,000,000

Recovery Time: 357s

Flink Core Concepts

Datastream API

High-level API for stream processing with transformations and windowing.

• Infinite data streams
• Event-time processing
• Watermarks & windows
• Exactly-once semantics
• Stateful operators

Checkpointing

Fault tolerance mechanism that snapshots operator state periodically.

• Consistent snapshots
• Barrier-based algorithm
• Incremental checkpoints
• Configurable intervals
• Fast recovery

State Management

Managed state for stateful stream processing with different backends.

• Keyed & operator state
• RocksDB backend
• State TTL support
• Queryable state
• Schema evolution

Time Semantics

Advanced time handling for out-of-order and late-arriving events.

• Event time processing
• Watermark generation
• Late event handling
• Processing time fallback
• Custom time extractors

Flink Windowing Strategies

Tumbling Windows

Non-overlapping windows of fixed size, each event belongs to exactly one window.

Tumbling Window Example

DataStream<Event> stream = ...;

stream
    .keyBy(event -> event.getUserId())
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new CountAggregator());

Sliding Windows

Overlapping windows that slide by a specified interval, useful for moving averages.

Sliding Window Example

stream
    .keyBy(event -> event.getDeviceId())
    .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(2)))
    .aggregate(new AverageAggregator());

Session Windows

Dynamic windows that group events based on periods of activity, ideal for user sessions.

Session Window Example

stream
    .keyBy(event -> event.getUserId())
    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
    .aggregate(new SessionAnalyzer());

Real-World Apache Flink Implementations

Netflix

Uses Flink for real-time data processing and analytics across their streaming platform.

• Processes 2+ trillion events daily
• Real-time recommendations and personalization
• Stream processing for content delivery optimization
• A/B testing and experimentation analytics

Alibaba

Operates one of the world's largest Flink deployments for e-commerce analytics.

• 4000+ Flink applications in production
• Real-time fraud detection and risk control
• Live dashboard for Singles' Day shopping festival
• Stream processing for search and recommendation systems

Uber

Leverages Flink for real-time monitoring and analytics of their ride-sharing platform.

• Real-time pricing and surge detection
• Driver-rider matching optimization
• Fraud detection and safety monitoring
• Real-time metrics and business intelligence

Deutsche Bank

Uses Flink for real-time risk management and regulatory compliance in trading.

• Real-time market data processing
• Risk calculation and position monitoring
• Regulatory reporting and compliance
• Trade surveillance and anomaly detection

Flink Architecture Components

JobManager

Coordinates distributed execution and manages job lifecycle.

• Schedules tasks
• Coordinates checkpoints
• Manages recovery
• Resource allocation

TaskManager

Worker processes that execute the actual stream processing tasks.

• Executes operators
• Manages local state
• Handles data exchange
• Reports metrics

ResourceManager

Manages TaskManager slots and resource allocation across the cluster.

• Slot management
• Resource provisioning
• Cluster scaling
• YARN/K8s integration

Apache Flink Best Practices

✅ Do

• Use event time for business-critical applications
• Configure checkpointing with appropriate intervals
• Use keyed state for scalable stateful processing
• Implement proper watermark strategies
• Monitor backpressure and tune parallelism
• Use RocksDB for large state applications
• Set appropriate TTL for managed state
• Implement idempotent sinks for exactly-once

❌ Don't

• Ignore watermark configuration in event time processing
• Use processing time when event time is needed
• Store large objects in operator state
• Forget to handle late events appropriately
• Use global windows without proper triggers
• Skip checkpoint interval tuning
• Ignore state backend performance characteristics
• Deploy without proper monitoring and alerting

No quiz questions available

Questions prop is empty