Hybrid pipeline

Introduction

Hybrid Pipeline is a Toucan feature that allows to optimize the execution of data transformation pipelines by intelligently combining Native SQL execution (e.g. the possibility to create Data transformation pipeline that will be translated in SQL and executed by the datasource) and in-memory processing where the steps are executed on the Toucan engine in-memory.

Execution Engines

Toucan supports several execution modes for YouPrep steps.

NativeSQL

Steps are translated into SQL queries and executed directly by the connected database which can be:

PostgreSQL
GoogleBigQuery
Snowflake
Amazon Redshift
Amazon Athena
MySQL (with the new execution system)
MsSQL and Azure SQL (with the new execution system)
ClickHouse (with the new execution system)

Toucan engine

Toucan engine refers to the execution of data transformation steps that takes place in the Toucan backend

In-memory: Toucan executes transformations in RAM.
Toucan Storage data store: For data loaded in "load" mode.

Data transformation pipeline

A data transformation pipeline is a sequence of YouPrep steps that transforms input dataset that could a dataset, or a combination of datasets into an output dataset.

How It Works

The data transformation pipeline is executed in NativeSQL mode as long as the steps are compatible. see here for more information.
When an incompatible step is encountered, execution switches to in-memory mode for the rest of the pipeline.
Execution can be shared between the data source and Toucan.

For example, if you insert in your data transformation a statistics step (that is not nativeSQL compatible and not translatable in SQL), in a data transformation pipeline where the other steps are compatible with native SQL, then from that step onwards, the following steps will be executed in Toucan's in-memory engine.

Specific Rules

Append and Join Operations

If both source pipelines are in NativeSQL, the operation is performed at the data source level.
Otherwise, the operation is performed in-memory.

The rules above replace the rules stated here

Child Datasets

let's take a child dataset coming from a NativeSQL datasource the execution is done on Toucan's side if:

The parent was NativeSQL compatible but is no longer for certain reasons (an incompatible dataset in its pipeline)
The parent is full NativeSQL (all the steps can be executed in NativeSQL) but is stored.
there is a step of the dataset which is not compatible with NativeSQL.

In other use cases, this child dataset is compatible with a NativeSQL pipeline execution.

Benefits of using hybrid pipelines

The hybrid pipeline feature

Increases flexibility in creating complex pipelines.
Automates performance optimization.
Allows to combine various data sources and steps.

Limitations of in-memory processing

For some steps (JOIN or APPEND step) RAM consumption can be significant and the performance depends on the underlying engine (database or the Toucan workspace)

Last updated 3 months ago

Was this helpful?

hashtagIntroduction

hashtagExecution Engines

hashtagNativeSQL

hashtagToucan engine

hashtagData transformation pipeline

hashtagHow It Works

hashtagSpecific Rules

hashtagAppend and Join Operations

hashtagChild Datasets

hashtagBenefits of using hybrid pipelines

hashtagLimitations of in-memory processing

Introduction

Execution Engines

NativeSQL

Toucan engine

Data transformation pipeline

How It Works

Specific Rules

Append and Join Operations

Child Datasets

Benefits of using hybrid pipelines

Limitations of in-memory processing