Caching Architecture

This documentation outlines the caching strategy employed by the new Data Data Execution System . It details the infrastructure, storage mechanisms, and workflows used to ensure high performance, data security, and efficient resource management for live queries.

1. Infrastructure & Storage Overview

The caching layer leverages a combination of S3 object storage and in-memory data stores hosted on our infrastructure.

Storage Backend: S3 Object Storage (S3-compatible)
Hosting Location: France.
Orchestration & State: Dragonfly (hosted on internal servers).
Data Format: Apache Arrow IPC Streaming.
Duration: 5 min (300 seconds)
Encryption: Yes

2. Live Cache Workflow

The "Live Cache" handles the temporary storage of query results to reduce latency for frequent requests.

2.1 Storage Mechanism

Location: Query results are stored in an S3 bucket hosted in France.
Encryption: All data in S3 is encrypted at rest using Server-Side Encryption with our on key (SSE-C).
Retention: The default Time-To-Live (TTL) for cached live queries is 300 seconds (5 minutes).

2.2 Data Serialization

To ensure high-performance data transfer, the Data Service API caches and retrieves queries using the Apache Arrow IPC Streaming format.

Ref: Arrow Columnar Format - IPC Streaming

2.3 Retrieval Process

When a query is requested:

Lookup: The system checks if a cache is available for the query with Dragonfly
Hit: If the data is cached, the Arrow IPC file is streamed directly from S3 to the Frontend.
Miss: If the data is not cached, the query is executed, the results are written to S3 (streamed as IPC), and then streamed to the client.

Note on Preview Data: Data used for previews flows is never cached and directly streamed to the client.

3. Technology Roles

To maintain distinct responsibilities within the architecture, different services handle specific aspects of the caching and execution lifecycle.

Dragonfly

Dragonfly acts as the primary interface for cache management.

It is the memory of available caches for queries.
It handles the high-performance caching operations.
Distributed Locks: Manages locks to prevent race conditions during query execution.

Last updated 3 months ago

Was this helpful?

hashtag1. Infrastructure & Storage Overview

hashtag2. Live Cache Workflow

hashtag2.1 Storage Mechanism

hashtag2.2 Data Serialization

hashtag2.3 Retrieval Process

hashtag3. Technology Roles

hashtagDragonfly