|
| 1 | +# WebSocket Notification Scalability |
| 2 | + |
| 3 | +## Context |
| 4 | +The **Flight Tracker** project provides users with real-time flight information via **WebSocket** notifications. Currently, the application maintains active WebSocket connections and sends updates directly. As the number of users and simultaneous events increases, the current solution is reaching **scalability limits**, primarily due to in-memory session storage and sticky session dependencies on the load balancer. |
| 5 | + |
| 6 | +## Objective |
| 7 | +This document aims to analyze architectural alternatives to scale WebSocket notification delivery while maintaining low latency, high reliability, and minimal disruption to the existing system. The goal is to find an evolutionary solution with fast time-to-market and a path toward future scalability. |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +### Non-functional Requirements |
| 12 | +- Scalability to thousands of simultaneous WebSocket connections |
| 13 | +- Low delivery latency (< 100ms) |
| 14 | +- High availability and fault tolerance |
| 15 | +- Reliable message delivery |
| 16 | +- Minimal frontend impact (initially) |
| 17 | + |
| 18 | +### Technical Requirements |
| 19 | +- Integration with existing Kafka infrastructure |
| 20 | +- Support for component decoupling |
| 21 | +- Compatibility with WebSocket and STOMP |
| 22 | +- Easy monitoring and operations |
| 23 | + |
| 24 | +## Considered Alternatives |
| 25 | + |
| 26 | +### 1. Kafka + ThreadExecutors |
| 27 | +Decouple event dispatching via Kafka and parallelize delivery using thread pools. No frontend changes required. |
| 28 | + |
| 29 | +**Pros:** scalable, no changes to the client, uses existing Kafka |
| 30 | +**Cons:** requires concurrency implementation and session control |
| 31 | + |
| 32 | +### 2. STOMP Broker (RabbitMQ/ActiveMQ) |
| 33 | +Use a STOMP-compatible message broker as an external relay. Frontends subscribe to STOMP topics via WebSocket. |
| 34 | + |
| 35 | +**Pros:** complete decoupling, mature pub/sub model |
| 36 | +**Cons:** requires frontend refactoring and broker setup |
| 37 | + |
| 38 | +### 3. Redis Streams/PubSub |
| 39 | +Use Redis for message publishing/subscribing or streams. Messages are distributed across WebSocket server instances. |
| 40 | + |
| 41 | +**Pros:** simple, fast, great for low latency |
| 42 | +**Cons:** pub/sub doesn't guarantee delivery; streams require additional handling |
| 43 | + |
| 44 | +### 4. Optimizing the Current Architecture |
| 45 | +Local improvements with thread pools, async I/O, or sticky session-based load balancing. |
| 46 | + |
| 47 | +**Pros:** low cost, fast implementation |
| 48 | +**Cons:** doesn't solve horizontal scalability limitations |
| 49 | + |
| 50 | +## Comparative Analysis |
| 51 | + |
| 52 | +| Solution | Scalability | Latency | Complexity | Reliability | Frontend Impact | |
| 53 | +|---------------------------|-------------|---------|------------|-------------|------------------| |
| 54 | +| Kafka + Threads | High | Medium | Medium | High | None | |
| 55 | +| STOMP Broker | High | Low | High | High | High | |
| 56 | +| Redis Streams/PubSub | Medium | Very Low| Medium | Medium/High | None | |
| 57 | +| Local Optimization | Low | Low | Low | Low | None | |
| 58 | + |
| 59 | +## Recommended Conclusion |
| 60 | + |
| 61 | +We recommend an **evolutionary approach in two phases**: |
| 62 | + |
| 63 | +### Phase 1: Refactoring with Kafka + Dedicated WebSocket Component |
| 64 | + |
| 65 | +Refactor the `PingEventPublisher` to publish events to a Kafka topic. A new (or existing) component will consume the topic and handle active WebSocket session management and message delivery. |
| 66 | + |
| 67 | +**Benefits:** |
| 68 | +- Fast time-to-market |
| 69 | +- Immediate scalability using current infrastructure |
| 70 | +- No frontend changes |
| 71 | +- Improves modularity and observability |
| 72 | + |
| 73 | +### Phase 2: Evolve to STOMP Broker |
| 74 | + |
| 75 | +If scalability needs increase significantly, migrate to a **STOMP-based broker architecture** (RabbitMQ or ActiveMQ), enabling: |
| 76 | +- Topic-based subscriptions |
| 77 | +- Automatic message distribution by the broker |
| 78 | +- Event-driven backend/frontend |
| 79 | + |
| 80 | +This phase requires more effort and frontend changes, so it's reserved for future growth that justifies the investment. |
| 81 | + |
| 82 | +### Why this phased approach? |
| 83 | + |
| 84 | +- **Time-to-market**: quick delivery with low risk |
| 85 | +- **Low disruption**: avoids major changes to frontend/backend for now |
| 86 | +- **Preparation**: creates a foundation for future pub/sub migration |
| 87 | + |
| 88 | +This shows how **architecture can evolve with minimal impact** while aligning with team capacity and business context. |
| 89 | + |
| 90 | +## References |
| 91 | + |
| 92 | +- [ADR 001 – WebSocket Scalability Strategy](../adrs/adr-001-websocket-scalability.md) |
| 93 | +- [ByteWise010, *“Scaling WebSockets with STOMP and RabbitMQ”*](https://medium.com/@bytewise010/scaling-websocket-messaging-with-spring-boot-e9877c80f027) |
| 94 | +- Internal Kafka and Redis benchmark experiences |
0 commit comments