# Caching Architecture

This documentation outlines the caching strategy employed by the new Data Data Execution System . It details the infrastructure, storage mechanisms, and workflows used to ensure high performance, data security, and efficient resource management for live queries.

### 1. Infrastructure & Storage Overview

The caching layer leverages a combination of S3 object storage and in-memory data stores hosted on our infrastructure.

* **Storage Backend**: S3 Object Storage (S3-compatible)
* **Hosting Location**: France.
* **Orchestration & State**: Dragonfly (hosted on internal servers).
* **Data Format**: Apache Arrow IPC Streaming.
* **Duration**: 5 min (300 seconds)
* **Encryption**: Yes

### 2. Live Cache Workflow

The "Live Cache" handles the temporary storage of query results to reduce latency for frequent requests.

#### 2.1 Storage Mechanism

* Location: Query results are stored in an S3 bucket hosted in France.
* Encryption: All data in S3 is encrypted at rest using Server-Side Encryption with our on key (SSE-C).
* Retention: The default Time-To-Live (TTL) for cached live queries is 300 seconds (5 minutes).

#### 2.2 Data Serialization

To ensure high-performance data transfer, the Data Service API caches and retrieves queries using the Apache Arrow IPC Streaming format.

* Ref: [Arrow Columnar Format - IPC Streaming](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format)

#### 2.3 Retrieval Process

When a query is requested:

1. Lookup: The system checks if a cache is available for the query with Dragonfly
2. Hit: If the data is cached, the Arrow IPC file is streamed directly from S3 to the Frontend.
3. Miss: If the data is not cached, the query is executed, the results are written to S3 (streamed as IPC), and then streamed to the client.

> Note on Preview Data: Data used for previews flows is never cached and directly streamed to the client.

### 3. Technology Roles

To maintain distinct responsibilities within the architecture, different services handle specific aspects of the caching and execution lifecycle.

#### Dragonfly

Dragonfly acts as the primary interface for cache management.

* It is the memory of available caches for queries.
* It handles the high-performance caching operations.
* Distributed Locks: Manages locks to prevent race conditions during query execution.
