🧑‍🍳Hybrid pipeline

Introduction

Hybrid Pipeline is an advanced feature of Toucan that optimizes the execution of data transformation pipelines by intelligently combining Native SQL execution and in-memory processing.

Key Concepts

Execution Engines

Toucan supports several execution modes for YouPrep steps:

  • NativeSQL: Steps are translated into SQL queries and executed directly by the connected database.

  • Toucan's side

    • In-memory: Toucan executes transformations in RAM.

    • Toucan Storage Database: For data loaded in "load" mode.

Pipeline

A pipeline is a sequence of YouPrep steps that transforms input dataset that could a dataset, or a combination of datasets into an output dataset.

Hybrid Pipeline

This feature allows Toucan to automatically determine the best engine to execute each step of a pipeline, prioritizing NativeSQL execution when possible.

How It Works

  1. The pipeline is executed in NativeSQL mode as long as the steps are compatible. see here for more information

  2. When an incompatible step is encountered, execution switches to in-memory mode for the rest of the pipeline.

  3. Execution can be shared between the data source and Toucan.

Specific Rules

Append and Join Operations

  • If both source pipelines are in NativeSQL, the operation is performed at the data source level.

  • Otherwise, the operation is performed in-memory.

The rules above replace the rules stated here

Child Datasets

let's take a child dataset coming from a NativeSQL datasource the execution is done on Toucan's side if:

  • The parent was NativeSQL compatible but is no longer for certain reasons (an incompatible dataset in its pipeline)

  • The parent is full NativeSQL (all the steps can be executed in NativeSQL) but is stored.

  • there is a step of the dataset which is not compatible with NativeSQL.

In other usecases, this child dataset is compatible with a NativeSQL pipeline execution.

Conclusion

The hybrid pipeline feature

  • Increases flexibility in creating complex pipelines.

  • Automates performance optimization.

  • Allows to combine various data sources and steps.

Limitations

For some steps (JOIN or APPEND step) RAM consumption can be significant and the performance depends on the underlying engine (database or the Toucan workspace)

Last updated