🚃Management of datasets stored under the new system
How the new Data Execution System handles stored datasets
Overview
The Refresh Data feature in Toucan enables updating datasets stored in the Toucan data store. This document explains how this mechanism works with the new Data Execution System (HADES), and outlines the main differences from the legacy execution engine.
The transition to a multi-tenant job execution system provides significant improvements in resource management efficiency, job stability, and overall performance.
Changes Introduced by the New Data Execution System
Our new system restructures the way refresh data jobs are processed. Compared to the legacy model, it changes:
How data processing jobs are managed
The efficiency and reliability of those jobs
The size and structure of stored data files
These changes collectively improve performance and scalability across multiple tenants.
Legacy Data Execution System
In the previous system:
Each workspace had a fixed amount of RAM for handling refresh jobs.
These jobs were processed by a dedicated backend worker specific to the Toucan stack.
Each backend worker managed all the refresh jobs for a single workspace.
Jobs could be interrupted arbitrarily if the backend worker reached its memory limits.
This setup restricted concurrency and stability when multiple refresh operations competed for memory within the same workspace.
How we handle refresh data jobs with our new Multi-Tenant Execution System
With our new multi-tenant execution system:
Job processing is multi-tenant and shared across customers.
Each refresh job runs in an isolated Unix process, ensuring fault isolation.
Each worker pod in Kubernetes has a fixed maximum RAM limit.
Memory is dynamically managed through a custom allocator, preventing any single job from consuming excessive memory.
Previews are executed synchronously within HADES to ensure consistency between preview and execution runs.
This architecture ensures fair resource allocation: one customer’s job cannot “steal” memory from another.

Memory Allocation for Refresh Jobs
Every refresh job has a defined memory limit. For stored datasets, the workflow typically involves:
Data loading: fetching the dataset into memory from storage.
In-memory processing: applying the defined transformations.
Result output: writing the processed data back to the Toucan data store.
Memory allocation occurs progressively during processing rather than preallocating the full limit (e.g., no initial malloc(4GiB)). However, since most jobs load data early, memory saturation will typically occur near the beginning if limits are exceeded.
HADES differs from the legacy engine by failing fast when memory limits are reached, rather than later in processing.
Factors affecting memory consumption
Column types: more complex data types (e.g., nested or string-heavy fields) consume more memory.
Transformation complexity: joins, aggregations, and sorts increase memory usage.
Column cardinality: columns with many unique values increase computational load.
Early filtering: applying filters earlier reduces downstream workload.
Column selection: operating on fewer columns minimizes memory usage.
Data Processing Workflow
Each refresh job follows the three standard data processing stages in Toucan:
1. Planning Extraction
During this step, our system plans everything needed to create this dataset.
2. Fetching Data
Data is retrieved from the source system (SQL query, remote file, or Toucan data store). Requirements:
Source must be available
Query must be processable within the source system’s constraints
Depending on pipeline type if NativeSQL pipeline all processing happens in the datasource before transfer to Toucan.
3. In-memory execution
The extracted data is loaded into memory, and transformations are applied in sequence.
Hybrid pipeline: early steps run in the datasource, later ones in the Toucan engine.
Toucan-only pipeline: all steps are processed in Toucan when incompatibilities exist (e.g., datasource not supporting SQL or hybrid chaining).
After transformations, the processed result is written to Toucan’s internal storage.
Performance and Efficiency Results
Performance comparisons show massive gains with HADES:
Simple Transformations
Up to 6 minutes
<10 seconds
97% faster
Complex Joins/Aggregations
2–3 minutes
<30 seconds
~90% faster
Data Storage Architecture
Under the new architecture:
Customer data is stored in dedicated S3 buckets, one per workspace.
The system is multi-tenant, processing jobs for all clients via a shared service layer.
Processed data is not persisted in the execution layer — only in S3.
In the future, we plan to allow customers to provide their own S3 buckets, maintaining full control of their data while Toucan writes transformation outputs directly there.
Data File Efficiency
The new storage layer uses compressed, columnar formats optimized for analytical workflows. Stored files occupy significantly less disk space compared to previous builds.
Files downloaded by users remain CSV-formatted, while storage files use a more efficient binary format.
Average disk usage per dataset has decreased considerably, resulting in faster reads and improved storage density.
QA: refresh data under the new data execution system
What is the Refresh Data feature in Toucan?
The Refresh Data feature updates datasets stored in the Toucan data store. It ensures your dashboards and metrics always reflect the most recent information from your sources.
What has changed with the new Data Execution System (HADES)?
Toucan has migrated from a legacy, single-tenant system to HADES, a multi-tenant execution engine designed for improved job isolation, higher efficiency, and faster refresh times.
Architecture and System Behavior
How does the HADES system manage data refresh jobs?
Each refresh job runs as an isolated Unix process within a Kubernetes worker pod. Memory usage is tightly controlled by a custom allocator, which prevents one job from consuming another’s resources.
What are the main differences between HADES and the legacy system?
Job Scope
One worker per workspace
Multi-tenant shared workers
Memory Isolation
Limited per workspace
Full per-job isolation
Failure Behavior
Random job interruptions
Controlled memory-bound failures
Resource Efficiency
Fixed per workspace
Dynamic balancing per job
How does HADES prevent resource conflicts between tenants?
Each worker pod has a predefined memory limit. If a job tries to allocate more memory than allowed, it fails quickly, protecting other tenants from performance degradation.
Memory Management and Performance
How is memory allocated for refresh jobs?
Memory is allocated progressively during job execution. Large datasets trigger early allocations during data fetching, so failures occur sooner if limits are exceeded.
What factors influence memory consumption?
Data type complexity
Transformation type (joins, aggregations, sorts)
Column cardinality (number of unique values)
Step ordering (early filtering is more efficient)
Number of columns used in transformations
Does Toucan use Polars for data processing?
Yes. Toucan uses Polars, which builds an optimized execution plan before allocating memory. This allows predictable, efficient use of resources across concurrent jobs.
What happens if a job exceeds its memory limit?
Jobs exceeding their memory quota fail immediately rather than gradually degrading performance. This provides faster feedback and better resource stability.
Data Processing Workflow
What are the main stages of a refresh job?
Every refresh job follows three main steps:
Data Extraction – Fetching information from the datasource or files.
Transformation – Applying data processing steps (filtering, joins, aggregations).
Loading – Writing the processed output back to Toucan storage.
What are the different pipeline types?
NativeSQL – Entirely processed within the datasource.
Hybrid – Early steps run in the datasource, later ones in Toucan.
Toucan-only – Fully executed in Toucan (when datasource limitations exist).
Data Storage and File Management
How is customer data stored?
Each customer’s data is isolated in a dedicated S3 bucket. HADES uses a shared service architecture for job execution, but no customer data is persisted in the compute layer.
Can customers use their own S3 buckets?
No but we plan to add it in a future update to allow customers to configure their own S3 storage so they retain full control over their data.
How are files stored and downloaded?
Files are stored in a compact, columnar format optimized for analytical queries.
Downloads always return in CSV format for compatibility and ease of use.
On average, file size in memory is significantly reduced compared to legacy formats
Last updated
Was this helpful?