Robust Queue System - HLD

Robust Queue System - HLD


Read me

Introduction:

In modern microservices architectures, asynchronous communication plays a vital role, especially when services rely on external clients or systems that take an unpredictable amount of time to process a request. A common pattern involves multiple services, queues, and webhooks. In this blog, we will explore the design of a queue-based messaging system to handle asynchronous operations between two services, including webhook handling and message processing in a holding queue.

The system we are building involves two core services:

Service 1: Triggers the workflow.

Service 2: Handles the integration with an external client, processes webhooks, and passes data back to Service 1.

We will focus on how to ensure data integrity in scenarios where the process is asynchronous and webhooks may arrive out of order.

High-Level Design (HLD)

The HLD for the messaging queue system involves the following key components:

Overview of the Design Components:
  1. Service 1: Initiates the request and sends it to Service 2. Consumes the final processed data from a persistent queue once it is ready.
  2. Service 2: Receives requests from Service 1 and sends them to an external Integration Service. Temporarily holds the request in a Redis holding queue while waiting for the webhook. When the webhook arrives, Service 2 enriches the request with the webhook data and pushes it to the persistent queue.
  3. Integration Service: Acts as an intermediary to the external client, manages asynchronous API calls, and receives webhook callbacks. Forwards the webhook to Service 2 once it arrives.
  4. Webhook Listener: Receives the webhook from the Integration Service and updates the corresponding message in the holding queue, signaling that it’s ready for final processing.
  5. Queues: Redis Holding Queue: Temporarily stores data from Service 2, indexed by a unique requestId, allowing random access and updates until the webhook arrives. Persistent Queue (e.g., Kafka or RabbitMQ): Stores fully processed messages that are ready to be consumed by Service 1.

High-Level System Flow:

  1. Service 1 initiates a request and sends it to Service 2.
  2. Service 2 forwards the request to an Integration Service, which interacts with an external client asynchronously.
  3. In the meantime, Service 2 stores the initial request in a Redis holding queue for temporary storage.
  4. When the Integration Service receives a webhook from the external client, it forwards it to Service 2.
  5. Service 2 retrieves the specific message from Redis, updates it with the webhook data, and moves the message to the persistent queue.
  6. Service 1 consumes the enriched data from the persistent queue.

Considerations and Notes:

In some cases, Service 2 may require specific data to process the webhook further, but the data stored in the holding queue might not have been relevant to Service 2 when the request was initially triggered. This could happen, for example, if headers like requestId or agentId were stored in the queue but weren’t necessary for Service 2’s immediate operations.

However, for Service 1, this data may be crucial. In such cases, the design we’ve outlined ensures that Service 2 can retrieve the relevant message from the holding queue, enrich it with any data received from the webhook or other sources, and pass it to Service 1 for further processing. This ensures that no data is lost, even if it wasn’t immediately necessary for the webhook processing.

Handling Data Overflow in the Queue: One potential issue with storing large amounts of data in the queue (e.g., both initial request data and webhook data) is the possibility of queue overflow. Queue systems like Redis (which we might use for the holding queue) are generally in-memory, meaning they are susceptible to memory limitations.

Solutions to Handle Overflow:

  1. Queue Size Limits and Eviction Policies: Implement a maximum size limit for the queue. When the limit is reached, older or lower-priority messages can be evicted using an eviction policy like LRU (Least Recently Used).
  2. Persistent Storage: Instead of relying solely on in-memory storage, consider persisting critical data in a durable data store (e.g., a database or a persistent message queue like Kafka). This ensures that even if the queue overflows or Redis goes down, critical data will not be lost. Note: While losing headers (like requestId) may not be a critical issue, losing an entire message payload due to queue overflow could severely impact the system. Using a database or durable message broker in such cases is a robust solution.
  3. Redis Persistence: If using Redis, you could enable Redis persistence (AOF or RDB snapshots) to ensure that data is recoverable even if the Redis server fails.

Scenarios:

  1. Queue Limitations and Random Access: Traditional queues (Kafka/RabbitMQ) enforce FIFO, limiting access to messages in the middle of the queue. Here, we use Redis hashes instead of a strict FIFO queue to store the holding messages. Redis hashes allow random access, enabling retrieval and updates to specific messages based on requestId when the webhook arrives.
  2. Redis Persistence for Durability: While Redis is an in-memory store, enabling AOF (Append Only File) or RDB (Redis Database Backup) ensures messages in the holding queue are stored on disk, minimizing data loss if Redis crashes. AOF logs every write operation, which is ideal for near real-time persistence. Redis persistence combined with periodic AOF rewrites or snapshots ensures that critical data can be recovered if Redis goes down.
  3. Data Consistency and Potential Overflow: Storing a large number of messages in Redis may lead to memory overflow. To handle this: Eviction Policies: Set Redis eviction policies to remove old or low-priority messages when Redis reaches memory limits. Hybrid Strategy: If overflow is a concern, store minimal message data (e.g., requestId) in Redis, and push fully processed messages directly to Kafka or RabbitMQ, reducing the load on Redis.
  4. Why Not a Database?: If the data is not relevant to Service 2 after processing, adding a database would add unnecessary complexity. Using Redis as a holding queue provides fast, in-memory access and persistence without the need for a separate database.
  5. Persistent Queue for Reliable Delivery: Once a message is processed and enriched with webhook data, it is pushed to a persistent queue (Kafka/RabbitMQ). This ensures reliable delivery and persistence until consumed by Service 1, maintaining the durability and fault tolerance needed for critical workflows.