Caching Architecture
This documentation outlines the caching strategy employed by the new Data Data Execution System . It details the infrastructure, storage mechanisms, and workflows used to ensure high performance, data security, and efficient resource management for live queries.
1. Infrastructure & Storage Overview
The caching layer leverages a combination of S3 object storage and in-memory data stores hosted on our infrastructure.
Storage Backend: S3 Object Storage (S3-compatible)
Hosting Location: France.
Orchestration & State: Dragonfly (hosted on internal servers).
Data Format: Apache Arrow IPC Streaming.
Duration: 5 min (300 seconds)
Encryption: Yes
2. Live Cache Workflow
The "Live Cache" handles the temporary storage of query results to reduce latency for frequent requests.
2.1 Storage Mechanism
Location: Query results are stored in an S3 bucket hosted in France.
Encryption: All data in S3 is encrypted at rest using Server-Side Encryption with our on key (SSE-C).
Retention: The default Time-To-Live (TTL) for cached live queries is 300 seconds (5 minutes).
2.2 Data Serialization
To ensure high-performance data transfer, the Data Service API caches and retrieves queries using the Apache Arrow IPC Streaming format.
2.3 Retrieval Process
When a query is requested:
Lookup: The system checks if a cache is available for the query with Dragonfly
Hit: If the data is cached, the Arrow IPC file is streamed directly from S3 to the Frontend.
Miss: If the data is not cached, the query is executed, the results are written to S3 (streamed as IPC), and then streamed to the client.
Note on Preview Data: Data used for previews flows is never cached and directly streamed to the client.
3. Technology Roles
To maintain distinct responsibilities within the architecture, different services handle specific aspects of the caching and execution lifecycle.
Dragonfly
Dragonfly acts as the primary interface for cache management.
It is the memory of available caches for queries.
It handles the high-performance caching operations.
Distributed Locks: Manages locks to prevent race conditions during query execution.
Last updated
Was this helpful?