# Management of datasets stored under the new system

### Overview <a href="#overview" id="overview"></a>

{% hint style="info" %}
Since the [v0.4.0](/additional-ressources/latest-releases/2025-releases.md#september-23-2025-v154) of our Data Execution System, we support refresh data jobs.
{% endhint %}

The *Refresh Data* feature in Toucan enables updating datasets stored in the Toucan data store. This document explains how this mechanism works with the new **Data Execution System (HADES)**, and outlines the main differences from the legacy execution engine.

The transition to a **multi-tenant job execution system** provides significant improvements in resource management efficiency, job stability, and overall performance.

***

### Changes Introduced by the New Data Execution System <a href="#changes-introduced-by-the-new-data-execution-syste" id="changes-introduced-by-the-new-data-execution-syste"></a>

Our new system restructures the way *refresh data jobs* are processed. Compared to the legacy model, it changes:

* How data processing jobs are managed
* The efficiency and reliability of those jobs
* The size and structure of stored data files

These changes collectively improve performance and scalability across multiple tenants.

***

### Legacy Data Execution System <a href="#legacy-data-execution-system" id="legacy-data-execution-system"></a>

In the previous system:

* Each **workspace** had a fixed amount of RAM for handling refresh jobs.
* These jobs were processed by a **dedicated backend worker** specific to the Toucan stack.
* Each backend worker managed all the refresh jobs for a single workspace.
* Jobs could be **interrupted arbitrarily** if the backend worker reached its memory limits.

This setup restricted concurrency and stability when multiple refresh operations competed for memory within the same workspace.

***

### How we handle refresh data jobs with our new Multi-Tenant Execution System <a href="#hades-new-multi-tenant-execution-system" id="hades-new-multi-tenant-execution-system"></a>

With our new multi-tenant execution system:

* Job processing is **multi-tenant** and shared across customers.
* Each refresh job runs in an **isolated Unix process**, ensuring fault isolation.
* Each **worker pod in Kubernetes** has a fixed maximum RAM limit.
* Memory is dynamically managed through a **custom allocator**, preventing any single job from consuming excessive memory.
* Previews are executed **synchronously** within HADES to ensure consistency between preview and execution runs.

This architecture ensures fair resource allocation: one customer’s job cannot “steal” memory from another.

<figure><img src="/files/IobjfxvSiPjCoIRyurDs" alt=""><figcaption><p>Data Execution Service - September 2025</p></figcaption></figure>

***

### Memory Allocation for Refresh Jobs <a href="#memory-allocation-for-refresh-jobs" id="memory-allocation-for-refresh-jobs"></a>

Every refresh job has a **defined memory limit**. For stored datasets, the workflow typically involves:

1. **Data loading:** fetching the dataset into memory from storage.
2. **In-memory processing:** applying the defined transformations.
3. **Result output:** writing the processed data back to the Toucan data store.

Memory allocation occurs progressively during processing rather than preallocating the full limit (e.g., no initial `malloc(4GiB)`). However, since most jobs load data early, memory saturation will typically occur near the beginning if limits are exceeded.

HADES differs from the legacy engine by **failing fast** when memory limits are reached, rather than later in processing.

### Factors affecting memory consumption

* **Column types:** more complex data types (e.g., nested or string-heavy fields) consume more memory.
* **Transformation complexity:** joins, aggregations, and sorts increase memory usage.
* **Column cardinality:** columns with many unique values increase computational load.
* **Early filtering:** applying filters earlier reduces downstream workload.
* **Column selection:** operating on fewer columns minimizes memory usage.

{% hint style="info" %}
Our service applies **lazy evaluation** and builds an **optimized execution plan** before allocating memory. This ensures efficient use of resources and predictable runtime behavior across concurrent jobs.
{% endhint %}

***

### Data Processing Workflow <a href="#data-processing-workflow" id="data-processing-workflow"></a>

Each refresh job follows the three standard data processing stages in Toucan:

### 1. Planning Extraction

During this step, our system plans everything needed to create this dataset.

### 2. Fetching Data

Data is retrieved from the source system (SQL query, remote file, or Toucan data store).\
Requirements:

* Source must be **available**
* Query must be **processable** within the source system’s constraints

Depending on pipeline type if **NativeSQL pipeline** all processing happens in the datasource before transfer to Toucan.

### 3. In-memory execution

The extracted data is loaded into memory, and transformations are applied in sequence.

* **Hybrid pipeline:** early steps run in the datasource, later ones in the Toucan engine.
* **Toucan-only pipeline:** all steps are processed in Toucan when incompatibilities exist (e.g., datasource not supporting SQL or hybrid chaining).

After transformations, the processed result is written to Toucan’s internal storage.

***

### Performance and Efficiency Results <a href="#performance-and-efficiency-results" id="performance-and-efficiency-results"></a>

Performance comparisons show **massive gains** with HADES:

| Scenario                   | Legacy System (Laputa) | HADES       | Performance Gain |
| -------------------------- | ---------------------- | ----------- | ---------------- |
| Simple Transformations     | Up to 6 minutes        | <10 seconds | 97% faster       |
| Complex Joins/Aggregations | 2–3 minutes            | <30 seconds | \~90% faster     |

***

### Data Storage Architecture <a href="#data-storage-architecture" id="data-storage-architecture"></a>

Under the new architecture:

* Customer data is stored in **dedicated S3 buckets**, one per workspace.
* The system is **multi-tenant**, processing jobs for all clients via a shared service layer.
* Processed data is **not persisted** in the execution layer — only in S3.

In the future, we plan to allow customers to **provide their own S3 buckets**, maintaining full control of their data while Toucan writes transformation outputs directly there.

***

### Data File Efficiency <a href="#disk-usage-and-data-file-efficiency" id="disk-usage-and-data-file-efficiency"></a>

The new storage layer uses **compressed, columnar formats** optimized for analytical workflows.\
Stored files occupy significantly less disk space compared to previous builds.

* Files downloaded by users remain **CSV-formatted**, while storage files use a more efficient binary format.
* Average disk usage per dataset has **decreased considerably**, resulting in faster reads and improved storage density.

***

### QA: refresh data under the new data execution system <a href="#maintenance-and-freshness-signals" id="maintenance-and-freshness-signals"></a>

### What is the *Refresh Data* feature in Toucan?

The *Refresh Data* feature updates datasets stored in the Toucan data store. It ensures your dashboards and metrics always reflect the most recent information from your sources.

### What has changed with the new Data Execution System (HADES)?

Toucan has migrated from a legacy, single-tenant system to **HADES**, a multi-tenant execution engine designed for improved job isolation, higher efficiency, and faster refresh times.

***

### Architecture and System Behavior <a href="#architecture-and-system-behavior" id="architecture-and-system-behavior"></a>

### How does the HADES system manage data refresh jobs?

Each refresh job runs as an isolated Unix process within a Kubernetes worker pod. Memory usage is tightly controlled by a **custom allocator**, which prevents one job from consuming another’s resources.

### What are the main differences between HADES and the legacy system?

| Aspect              | Legacy System            | HADES                            |
| ------------------- | ------------------------ | -------------------------------- |
| Job Scope           | One worker per workspace | Multi-tenant shared workers      |
| Memory Isolation    | Limited per workspace    | Full per-job isolation           |
| Failure Behavior    | Random job interruptions | Controlled memory-bound failures |
| Resource Efficiency | Fixed per workspace      | Dynamic balancing per job        |

### How does HADES prevent resource conflicts between tenants?

Each worker pod has a predefined memory limit. If a job tries to allocate more memory than allowed, it fails quickly, protecting other tenants from performance degradation.

***

### Memory Management and Performance <a href="#memory-management-and-performance" id="memory-management-and-performance"></a>

### How is memory allocated for refresh jobs?

Memory is allocated progressively during job execution. Large datasets trigger early allocations during data fetching, so failures occur sooner if limits are exceeded.

### What factors influence memory consumption?

* Data type complexity
* Transformation type (joins, aggregations, sorts)
* Column cardinality (number of unique values)
* Step ordering (early filtering is more efficient)
* Number of columns used in transformations

### Does Toucan use Polars for data processing?

Yes. Toucan uses **Polars**, which builds an optimized execution plan before allocating memory. This allows predictable, efficient use of resources across concurrent jobs.

### What happens if a job exceeds its memory limit?

Jobs exceeding their memory quota fail immediately rather than gradually degrading performance. This provides faster feedback and better resource stability.

***

### Data Processing Workflow <a href="#data-processing-workflow" id="data-processing-workflow"></a>

### What are the main stages of a refresh job?

Every refresh job follows three main steps:

1. **Data Extraction** – Fetching information from the datasource or files.
2. **Transformation** – Applying data processing steps (filtering, joins, aggregations).
3. **Loading** – Writing the processed output back to Toucan storage.

### What are the different pipeline types?

* **NativeSQL** – Entirely processed within the datasource.
* **Hybrid** – Early steps run in the datasource, later ones in Toucan.
* **Toucan-only** – Fully executed in Toucan (when datasource limitations exist).

***

### Data Storage and File Management <a href="#data-storage-and-file-management" id="data-storage-and-file-management"></a>

### How is customer data stored?

Each customer’s data is isolated in a **dedicated S3 bucket**. HADES uses a shared service architecture for job execution, but no customer data is persisted in the compute layer.

### Can customers use their own S3 buckets?

No but we plan to add it in a future update to allow customers to configure their own S3 storage so they retain full control over their data.

### How are files stored and downloaded?

* Files are stored in a compact, columnar format optimized for analytical queries.
* Downloads always return in **CSV format** for compatibility and ease of use.
* On average, file size in memory is significantly reduced compared to legacy formats


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs-v3.toucantoco.com/data-management-in-datahub/new-data-execution-system/management-of-datasets-stored-under-the-new-system.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
