🚃Management of datasets stored under the new system

How the new Data Execution System handles stored datasets

Overview

Since the v0.4.0 of our Data Execution System, we support refresh data jobs.

The Refresh Data feature in Toucan enables updating datasets stored in the Toucan data store. This document explains how this mechanism works with the new Data Execution System (HADES), and outlines the main differences from the legacy execution engine.

The transition to a multi-tenant job execution system provides significant improvements in resource management efficiency, job stability, and overall performance.

Changes Introduced by the New Data Execution System

Our new system restructures the way refresh data jobs are processed. Compared to the legacy model, it changes:

How data processing jobs are managed
The efficiency and reliability of those jobs
The size and structure of stored data files

These changes collectively improve performance and scalability across multiple tenants.

Legacy Data Execution System

In the previous system:

Each workspace had a fixed amount of RAM for handling refresh jobs.
These jobs were processed by a dedicated backend worker specific to the Toucan stack.
Each backend worker managed all the refresh jobs for a single workspace.
Jobs could be interrupted arbitrarily if the backend worker reached its memory limits.

This setup restricted concurrency and stability when multiple refresh operations competed for memory within the same workspace.

How we handle refresh data jobs with our new Multi-Tenant Execution System

With our new multi-tenant execution system:

Job processing is multi-tenant and shared across customers.
Each refresh job runs in an isolated Unix process, ensuring fault isolation.
Each worker pod in Kubernetes has a fixed maximum RAM limit.
Memory is dynamically managed through a custom allocator, preventing any single job from consuming excessive memory.
Previews are executed synchronously within HADES to ensure consistency between preview and execution runs.

This architecture ensures fair resource allocation: one customer’s job cannot “steal” memory from another.

Memory Allocation for Refresh Jobs

Every refresh job has a defined memory limit. For stored datasets, the workflow typically involves:

Data loading: fetching the dataset into memory from storage.
In-memory processing: applying the defined transformations.
Result output: writing the processed data back to the Toucan data store.

Memory allocation occurs progressively during processing rather than preallocating the full limit (e.g., no initial malloc(4GiB)). However, since most jobs load data early, memory saturation will typically occur near the beginning if limits are exceeded.

HADES differs from the legacy engine by failing fast when memory limits are reached, rather than later in processing.

Factors affecting memory consumption

Column types: more complex data types (e.g., nested or string-heavy fields) consume more memory.
Transformation complexity: joins, aggregations, and sorts increase memory usage.
Column cardinality: columns with many unique values increase computational load.
Early filtering: applying filters earlier reduces downstream workload.
Column selection: operating on fewer columns minimizes memory usage.

Our service applies lazy evaluation and builds an optimized execution plan before allocating memory. This ensures efficient use of resources and predictable runtime behavior across concurrent jobs.

Data Processing Workflow

Each refresh job follows the three standard data processing stages in Toucan:

1. Planning Extraction

During this step, our system plans everything needed to create this dataset.

2. Fetching Data

Data is retrieved from the source system (SQL query, remote file, or Toucan data store). Requirements:

Source must be available
Query must be processable within the source system’s constraints

Depending on pipeline type if NativeSQL pipeline all processing happens in the datasource before transfer to Toucan.

3. In-memory execution

The extracted data is loaded into memory, and transformations are applied in sequence.

Hybrid pipeline: early steps run in the datasource, later ones in the Toucan engine.
Toucan-only pipeline: all steps are processed in Toucan when incompatibilities exist (e.g., datasource not supporting SQL or hybrid chaining).

After transformations, the processed result is written to Toucan’s internal storage.

Performance and Efficiency Results

Performance comparisons show massive gains with HADES:

Scenario

Legacy System (Laputa)

HADES

Performance Gain

Simple Transformations

Up to 6 minutes

<10 seconds

97% faster

Complex Joins/Aggregations

2–3 minutes

<30 seconds

~90% faster

Data Storage Architecture

Under the new architecture:

Customer data is stored in dedicated S3 buckets, one per workspace.
The system is multi-tenant, processing jobs for all clients via a shared service layer.
Processed data is not persisted in the execution layer — only in S3.

In the future, we plan to allow customers to provide their own S3 buckets, maintaining full control of their data while Toucan writes transformation outputs directly there.

Data File Efficiency

The new storage layer uses compressed, columnar formats optimized for analytical workflows. Stored files occupy significantly less disk space compared to previous builds.

Files downloaded by users remain CSV-formatted, while storage files use a more efficient binary format.
Average disk usage per dataset has decreased considerably, resulting in faster reads and improved storage density.

QA: refresh data under the new data execution system

What is the Refresh Data feature in Toucan?

The Refresh Data feature updates datasets stored in the Toucan data store. It ensures your dashboards and metrics always reflect the most recent information from your sources.

What has changed with the new Data Execution System (HADES)?

Toucan has migrated from a legacy, single-tenant system to HADES, a multi-tenant execution engine designed for improved job isolation, higher efficiency, and faster refresh times.

Architecture and System Behavior

How does the HADES system manage data refresh jobs?

Each refresh job runs as an isolated Unix process within a Kubernetes worker pod. Memory usage is tightly controlled by a custom allocator, which prevents one job from consuming another’s resources.

What are the main differences between HADES and the legacy system?

Aspect

Legacy System

HADES

Job Scope

One worker per workspace

Multi-tenant shared workers

Memory Isolation

Limited per workspace

Full per-job isolation

Failure Behavior

Random job interruptions

Controlled memory-bound failures

Resource Efficiency

Fixed per workspace

Dynamic balancing per job

How does HADES prevent resource conflicts between tenants?

Each worker pod has a predefined memory limit. If a job tries to allocate more memory than allowed, it fails quickly, protecting other tenants from performance degradation.

Memory Management and Performance

How is memory allocated for refresh jobs?

Memory is allocated progressively during job execution. Large datasets trigger early allocations during data fetching, so failures occur sooner if limits are exceeded.

What factors influence memory consumption?

Data type complexity
Transformation type (joins, aggregations, sorts)
Column cardinality (number of unique values)
Step ordering (early filtering is more efficient)
Number of columns used in transformations

Does Toucan use Polars for data processing?

Yes. Toucan uses Polars, which builds an optimized execution plan before allocating memory. This allows predictable, efficient use of resources across concurrent jobs.

What happens if a job exceeds its memory limit?

Jobs exceeding their memory quota fail immediately rather than gradually degrading performance. This provides faster feedback and better resource stability.

Data Processing Workflow

What are the main stages of a refresh job?

Every refresh job follows three main steps:

Data Extraction – Fetching information from the datasource or files.
Transformation – Applying data processing steps (filtering, joins, aggregations).
Loading – Writing the processed output back to Toucan storage.

What are the different pipeline types?

NativeSQL – Entirely processed within the datasource.
Hybrid – Early steps run in the datasource, later ones in Toucan.
Toucan-only – Fully executed in Toucan (when datasource limitations exist).

Data Storage and File Management

How is customer data stored?

Each customer’s data is isolated in a dedicated S3 bucket. HADES uses a shared service architecture for job execution, but no customer data is persisted in the compute layer.

Can customers use their own S3 buckets?

No but we plan to add it in a future update to allow customers to configure their own S3 storage so they retain full control over their data.

How are files stored and downloaded?

Files are stored in a compact, columnar format optimized for analytical queries.
Downloads always return in CSV format for compatibility and ease of use.
On average, file size in memory is significantly reduced compared to legacy formats

Last updated 9 days ago

Was this helpful?