Hybrid pipeline

Introduction

Hybrid Pipeline is a Toucan feature that allows to optimize the execution of data transformation pipelines by intelligently combining Native SQL execution (e.g. the possibility to create Data transformation pipeline that will be translated in SQL and executed by the datasource) and in-memory processing where the steps are executed on the Toucan engine in-memory.

Execution Engines

Toucan supports several execution modes for YouPrep steps.

NativeSQL

Steps are translated into SQL queries and executed directly by the connected database which can be:

  • PostgreSQL

  • GoogleBigQuery

  • Snowflake

  • Amazon Redshift

  • Amazon Athena

Toucan engine

Toucan engine refers to the execution of data transformation steps that takes place in the Toucan backend

  • In-memory: Toucan executes transformations in RAM.

  • Toucan Storage data store: For data loaded in "load" mode.

Data transformation pipeline

A data transformation pipeline is a sequence of YouPrep steps that transforms input dataset that could a dataset, or a combination of datasets into an output dataset.

How It Works

  1. The data transformation pipeline is executed in NativeSQL mode as long as the steps are compatible. see here for more information.

  2. When an incompatible step is encountered, execution switches to in-memory mode for the rest of the pipeline.

  3. Execution can be shared between the data source and Toucan.

For example, if you insert in your data transformation a statistics step (that is not nativeSQL compatible and not translatable in SQL), in a data transformation pipeline where the other steps are compatible with native SQL, then from that step onwards, the following steps will be executed in Toucan's in-memory engine.

Decision process for pipeline execution

Specific Rules

Append and Join Operations

  • If both source pipelines are in NativeSQL, the operation is performed at the data source level.

  • Otherwise, the operation is performed in-memory.

JOIN/APPEND decision process for execution

Child Datasets

let's take a child dataset coming from a NativeSQL datasource the execution is done on Toucan's side if:

  • The parent was NativeSQL compatible but is no longer for certain reasons (an incompatible dataset in its pipeline)

  • The parent is full NativeSQL (all the steps can be executed in NativeSQL) but is stored.

  • there is a step of the dataset which is not compatible with NativeSQL.

In other use cases, this child dataset is compatible with a NativeSQL pipeline execution.

Benefits of using hybrid pipelines

The hybrid pipeline feature

  • Increases flexibility in creating complex pipelines.

  • Automates performance optimization.

  • Allows to combine various data sources and steps.

Limitations of in-memory processing

For some steps (JOIN or APPEND step) RAM consumption can be significant and the performance depends on the underlying engine (database or the Toucan workspace)

Last updated

Was this helpful?