Troubleshoot:: DataHub

This page intends to help you to troubleshoot issues that can occur when you trying to access data or transform data.

Introduction

This document provides a comprehensive overview of data sources within Toucan, focusing on connectors, datasets, and query configurations. It aims to guide users through the process of connecting to various data sources, understanding dataset structures, and effectively configuring queries for optimal data retrieval and transformation. By understanding these concepts, users can leverage Toucan to gain valuable insights from their data efficiently.

Data Sources in Toucan

Connectors

Regarding connections, we have two main types of connectors:

Connectors where we perform a connection checklist. These connectors have a "Test connection" button.
Other connectors where we do not perform a connection checklist. For example, the HTTP API connector, where it's not always possible to reach a first endpoint to ensure that your configuration is correct.

We will focus on connectors with a "Test connection" button.

The checklist is different for each connector, but let's take PostgreSQL as an example.

PostgreSQL Connection Checklist

Test Step

What it means

Host Resolved

Toucan can reach the IP or host you have set in the input field.

Port Opened

The port filled in the connection form is an open port on your PostgreSQL cluster.

Connected to PostgreSQL

The host:port corresponds to a PostgreSQL instance, not a MySQL cluster, for example.

Authenticated

The login/password provided allows us to connect to the database you want to reach on your PostgreSQL cluster. The user has sufficient rights to access the cluster.

Default Database Connection

The default database field refers to an existing, accessible database associated with the user configured in

Example:

Let's consider that you have the following settings to access your database:

host: ***-postgre-*******.net
port: 2******1
user: username
password: ******
default database: postgres

Host is not resolved (reachable)

If your host is not reachable or the value you entered has a spelling mistake, your test connection will stop at the first step of the test connection

Incorrect port

Port: 2***0 instead of 2***1

Incorrect user/password

Username: You filled my_dear instead of username

Incorrect database default

Default database : You fill an incorrect database name postdre instead of postgres

Datasets

The Dataset service manages all objects related to datasets and some of their business logic. It handles datasets and data providers.

To get your data, it executes a plan, calls a data provider, and applies a series of data transformation steps to output a data table.

The execution can be done:

On your database (see native SQL for more explanation). In this case, an SQL request is sent to your database at the end of the pipeline.
On our in-memory engine.
In a combination of the two, depending on the position and compatibility of the steps.

The execution is mainly done through YouPrep and the query-configuration part.

Query configuration is the step after you configure your data source before you reach YouPrep.

To execute a plan depending on this situation, the data source must be available.

Each step must be executable, so its output should be a data table.

Query Configuration

When you set up a data source of a connector type and try to fetch data, Toucan interprets your settings and queries the data source. The data returned is modified by Toucan. We limit SQL queries for performances reasons.

This part can vary depending on the connector you're using.

Code Mode

In this specific query configuration setup, you can create an SQL query with placeholders. We'll insert the needed variables and send it to your database.

If your response is valid, a data table will appear on the right. If there's an issue, an error message from your database, reformulated by Toucan, will be display in the data table panel.

When using this mode, know about crafting SQL can be helpful. Alternatively you can also for some connectors use our simplified query mode

Example:

Here, we are trying to send an SQL query to a table containing the following columns: date, identity_id, value_2, value_1. But the SQL statement we wrote is SELECT value_3, identity_id from 'public'.sample_data' and the query ends up in an error.

Explanation

The PostgreSQL database has no column named value_3 in the sample_data table and returns this information to us.

YouPrep

In YouPrep, you can configure a series of steps to be applied to your data table. In this mode, each step is applied when you click on it.

Each step is applied on the preview computation of 10k rows by default, allowing you to have a view of what your data table will look like after each step.

The data computation can be done:

On the data source you have connected (for Native SQL connectors).
In our in-memory engine (for non-Native SQL connectors or flat files).
In a combination of Native SQL execution then in-memory (hybrid pipeline)

A step takes a data table as input, applies a step based on a configuration, and outputs a data table

If a step is in error, in the right tab, you will see the error corresponding to the step.

If an error occurs, check that:

You apply a step to the right type.
The column you're trying to transform exists previously in the dataset.
You have read the helper text in the configuration panel.

Errors returned can depend on the plan execution of the dataset, if the dataset is executed on the datasource, for example as we do in NativeSQL the return will come from the database.

You can check an exhaustive doc of YouPrep here

Timeout and Optimization

To ensure optimal performance for your queries and dashboard speed, adhere to these steps. Remember, the preview output is limited to 30 seconds.

It will improve the performance of your data queries. It's crucial to optimize to structure your steps in the right order.

Optimize the order of steps in your dataset pipeline

Filter

Start by filtering your data as early as possible. This step reduces the size of your dataset, which in turn decreases memory usage and speeds up subsequent operations.

Keep columns

After filtering, select only the columns you need for your analysis.

Text and Date transformations steps

Perform text and date transformations early in your pipeline. These operations can be computationally expensive, so doing them on a reduced dataset is more efficient.

Conditional column

Create conditional columns after text and date transformations but before aggregations. This allows you to use the newly created columns in subsequent operations.

Unique values

If you need to find unique values, do this before aggregations or joins. It can help reduce the dataset size for subsequent operations.

Sort

Sorting should be done after filtering and transformations but before aggregations. This can improve the performance of subsequent group-by operations.

Group By and Aggregate

Perform group-by and aggregate operations after the dataset has been reduced and transformed. This minimizes the amount of data being processed in these computationally intensive operations.

Join

Perform joins after you've reduced and aggregated your datasets. This minimizes the amount of data being joined, which can significantly improve performance.

Append

Append operations should generally be done near the end of your pipeline, after you've processed and transformed your individual datasets.

Pivot/Unpivot

Pivot and unpivot operations are typically done as final reshaping steps after all other transformations.

Reduce the amount of rows that the service treats

If you have difficulties having a return in 30 seconds, for example, you are working on a refresh dataset that will be executed asynchronously.

You can set the preview computation on a lower amount of rows. By default, the value is 10 000, but it can be reduced to transform less in the preview, allowing you to have a result before the timeout.

Troubleshooting Infinite Loops in Data Preparation

Issue Description

Infinite loops can severely impact the functionality of an application, depending on the affected dataset, and may prevent the application from being published. If an infinite loop affects a dataset, it will also affect all child datasets.

Identifying the Problem

To resolve this issue, you need to identify the affected dataset. When opening the dataset, you may encounter an error message similar to:

RecursiveQueryError: Recursive query detected {{uuid}} was already visited

Root Cause

This error occurs when a filter is used to filter the same dataset that the filter is based on, creating an infinite loop (hence the RecursiveQueryError message). It can also happen when a dataset references itself.

Example Scenario

All filters on the home page use a dataset called {{dataset_name}}. If this dataset is filtered by these same filters, it creates a recursive loop.

For instance, your filter {{filter_name}} uses the dataset {{dataset_name}}. This dataset is constructed in a way that it's filtered by a filter with the ID {{uuid_dataset}}, which is actually {{filter_name}} itself.

Additionally, you're using the filter {{filter_name2}}, which is filtered by {{filter_name}} and is also a child of the dataset that originates {{filter_name2}}. This prevents the visualization of both datasets due to the loop formation.

Locating the Dataset

Each dataset has a unique ID. You can find the dataset using the following URL structure:

https://{{workspace_name}}.example.com/{{app_name}}/datahub/datasets/edit-dataset/{{uuid_dataset}}?stage=staging

Resolution Steps

To correct this issue, you have two options:

Remove the problematic filtering step
Use different source datasets for the filters

By implementing one of these solutions, you should be able to resolve the infinite loop and restore proper functionality to your application.

Error Categories

Below is a list of error categories you might encounter while using Toucan. Each category includes a definition and, if possible, an example.

Error Type

Definition

Example

PlanExecutionError

One of the configurations of your steps prevents its execution.

Check the configuration of your steps for type errors, non-existent columns.

StepInUnsupportedPosition

Step is in a position that prevents its execution.

DatasourceNotFound

The data source to perform the steps has been removed, preventing pipeline execution.

A data source has been deleted from data sources.

PlanExecutionError

Exception raised when a plan execution fails.

QueryNotFoundError

Exception raised when a query cannot be found.

RecursiveQueryError

Exception raised when a recursive query is detected.

A filter has been used to filter the dataset used by that same filter, creating an infinite loop.

DatasetServiceError

Raised when the dataset service faced an error when handling a query.

A dataset cannot be found, a step is not supported, a connector is not supported.

ValueError

No data provider name.

TypeError

The value is a formula or a string literal that can't be parsed.

User attributes in response data were not understood.

ValidationError

Raised when dataset cannot be validated.

Conclusion

By understanding these concepts related to data sources, connectors, datasets, and query configurations, you can effectively leverage Toucan to gain valuable insights from your data. Remember to optimize your steps for better performance and to handle errors effectively.

If necessary, please reach out to our support team for more information or assistance. We're here to help you make the most of Toucan

Last updated 5 months ago

Was this helpful?