Troubleshoot:: DataHub
This page intends to help you to troubleshoot issues that can occur when you trying to access data or transform data.
Introduction
This document provides a comprehensive overview of data sources within Toucan, focusing on connectors, datasets, and query configurations. It aims to guide users through the process of connecting to various data sources, understanding dataset structures, and effectively configuring queries for optimal data retrieval and transformation. By understanding these concepts, users can leverage Toucan to gain valuable insights from their data efficiently.
Data Sources in Toucan
Connectors
Regarding connections, we have two main types of connectors:
Connectors where we perform a connection checklist. These connectors have a "Test connection" button.
Other connectors where we do not perform a connection checklist. For example, the HTTP API connector, where it's not always possible to reach a first endpoint to ensure that your configuration is correct.
We will focus on connectors with a "Test connection" button.
The checklist is different for each connector, but let's take PostgreSQL as an example.
PostgreSQL Connection Checklist
Host Resolved
Toucan can reach the IP or host you have set in the input field.
Port Opened
The port filled in the connection form is an open port on your PostgreSQL cluster.
Connected to PostgreSQL
The host:port corresponds to a PostgreSQL instance, not a MySQL cluster, for example.
Authenticated
The login/password provided allows us to connect to the database you want to reach on your PostgreSQL cluster. The user has sufficient rights to access the cluster.
Default Database Connection
The default database field refers to an existing, accessible database associated with the user configured in
Example:
Let's consider that you have the following settings to access your database:
host:
***-postgre-*******.net
port:
2******1
user:
username
password:
******
default database:
postgres
Host is not resolved (reachable)
If your host is not reachable or the value you entered has a spelling mistake, your test connection will stop at the first step of the test connection
Incorrect port
Port: 2***0
instead of 2***1
Incorrect user/password
Username: You filled my_dear
instead of username
Incorrect database default
Default database : You fill an incorrect database name postdre
instead of postgres
Datasets
The Dataset service manages all objects related to datasets and some of their business logic. It handles datasets and data providers.
To get your data, it executes a plan, calls a data provider, and applies a series of data transformation steps to output a data table.
The execution can be done:
On your database (see native SQL for more explanation). In this case, an SQL request is sent to your database at the end of the pipeline.
On our in-memory engine.
In a combination of the two, depending on the position and compatibility of the steps.
The execution is mainly done through YouPrep and the query-configuration part.
Query configuration is the step after you configure your data source before you reach YouPrep.
To execute a plan depending on this situation, the data source must be available.
Each step must be executable, so its output should be a data table.
Query Configuration
When you set up a data source of a connector type and try to fetch data, Toucan interprets your settings and queries the data source. The data returned is modified by Toucan. We limit SQL queries for performances reasons.
This part can vary depending on the connector you're using.
Code Mode
In this specific query configuration setup, you can create an SQL query with placeholders. We'll insert the needed variables and send it to your database.
If your response is valid, a data table will appear on the right. If there's an issue, an error message from your database, reformulated by Toucan, will be display in the data table panel.
When using this mode, know about crafting SQL can be helpful. Alternatively you can also for some connectors use our simplified query mode
Example:
Here, we are trying to send an SQL query to a table containing the following columns: date
, identity_id
, value_2
, value_1
. But the SQL statement we wrote is SELECT value_3, identity_id from 'public'.sample_data'
and the query ends up in an error.
Explanation
The PostgreSQL database has no column named value_3
in the sample_data
table and returns this information to us.
YouPrep
In YouPrep, you can configure a series of steps to be applied to your data table. In this mode, each step is applied when you click on it.
Each step is applied on the preview computation of 10k rows by default, allowing you to have a view of what your data table will look like after each step.
The data computation can be done:
On the data source you have connected (for Native SQL connectors).
In our in-memory engine (for non-Native SQL connectors or flat files).
In a combination of Native SQL execution then in-memory (hybrid pipeline)
A step takes a data table as input, applies a step based on a configuration, and outputs a data table
If a step is in error, in the right tab, you will see the error corresponding to the step.
If an error occurs, check that:
You apply a step to the right type.
The column you're trying to transform exists previously in the dataset.
You have read the helper text in the configuration panel.
Errors returned can depend on the plan execution of the dataset, if the dataset is executed on the datasource, for example as we do in NativeSQL the return will come from the database.
You can check an exhaustive doc of YouPrep here
Timeout and Optimization
To ensure optimal performance for your queries and dashboard speed, adhere to these steps. Remember, the preview output is limited to 30 seconds
.
It will improve the performance of your data queries. It's crucial to optimize to structure your steps in the right order.
Optimize the order of steps in your dataset pipeline
Filter
Start by filtering your data as early as possible. This step reduces the size of your dataset, which in turn decreases memory usage and speeds up subsequent operations.
Keep columns
After filtering, select only the columns you need for your analysis.
Text and Date transformations steps
Perform text and date transformations early in your pipeline. These operations can be computationally expensive, so doing them on a reduced dataset is more efficient.
Conditional column
Create conditional columns after text and date transformations but before aggregations. This allows you to use the newly created columns in subsequent operations.
Unique values
If you need to find unique values, do this before aggregations or joins. It can help reduce the dataset size for subsequent operations.
Sort
Sorting should be done after filtering and transformations but before aggregations. This can improve the performance of subsequent group-by operations.
Group By and Aggregate
Perform group-by and aggregate operations after the dataset has been reduced and transformed. This minimizes the amount of data being processed in these computationally intensive operations.
Join
Perform joins after you've reduced and aggregated your datasets. This minimizes the amount of data being joined, which can significantly improve performance.
Append
Append operations should generally be done near the end of your pipeline, after you've processed and transformed your individual datasets.
Pivot/Unpivot
Pivot and unpivot operations are typically done as final reshaping steps after all other transformations.
Reduce the amount of rows that the service treats
If you have difficulties having a return in 30 seconds, for example, you are working on a refresh dataset that will be executed asynchronously.
You can set the preview computation on a lower amount of rows. By default, the value is 10 000
, but it can be reduced to transform less in the preview, allowing you to have a result before the timeout.
Troubleshooting Infinite Loops in Data Preparation
Issue Description
Infinite loops can severely impact the functionality of an application, depending on the affected dataset, and may prevent the application from being published. If an infinite loop affects a dataset, it will also affect all child datasets.
Identifying the Problem
To resolve this issue, you need to identify the affected dataset. When opening the dataset, you may encounter an error message similar to:
Root Cause
This error occurs when a filter is used to filter the same dataset that the filter is based on, creating an infinite loop (hence the RecursiveQueryError message). It can also happen when a dataset references itself.
Example Scenario
All filters on the home page use a dataset called {{dataset_name}}
. If this dataset is filtered by these same filters, it creates a recursive loop.
For instance, your filter {{filter_name}}
uses the dataset {{dataset_name}}
. This dataset is constructed in a way that it's filtered by a filter with the ID {{uuid_dataset}}
, which is actually {{filter_name}}
itself.
Additionally, you're using the filter {{filter_name2}}
, which is filtered by {{filter_name}}
and is also a child of the dataset that originates {{filter_name2}}
. This prevents the visualization of both datasets due to the loop formation.
Locating the Dataset
Each dataset has a unique ID. You can find the dataset using the following URL structure:
Resolution Steps
To correct this issue, you have two options:
Remove the problematic filtering step
Use different source datasets for the filters
By implementing one of these solutions, you should be able to resolve the infinite loop and restore proper functionality to your application.
Error Categories
Below is a list of error categories you might encounter while using Toucan. Each category includes a definition and, if possible, an example.
PlanExecutionError
One of the configurations of your steps prevents its execution.
Check the configuration of your steps for type errors, non-existent columns.
StepInUnsupportedPosition
Step is in a position that prevents its execution.
DatasourceNotFound
The data source to perform the steps has been removed, preventing pipeline execution.
A data source has been deleted from data sources.
PlanExecutionError
Exception raised when a plan execution fails.
QueryNotFoundError
Exception raised when a query cannot be found.
RecursiveQueryError
Exception raised when a recursive query is detected.
A filter has been used to filter the dataset used by that same filter, creating an infinite loop.
DatasetServiceError
Raised when the dataset service faced an error when handling a query.
A dataset cannot be found, a step is not supported, a connector is not supported.
ValueError
No data provider name.
TypeError
The value is a formula or a string literal that can't be parsed.
User attributes in response data were not understood.
ValidationError
Raised when dataset cannot be validated.
Conclusion
By understanding these concepts related to data sources, connectors, datasets, and query configurations, you can effectively leverage Toucan to gain valuable insights from your data. Remember to optimize your steps for better performance and to handle errors effectively.
If necessary, please reach out to our support team for more information or assistance. We're here to help you make the most of Toucan
Last updated
Was this helpful?