📁Adding and combining remote files in Toucan

Sometimes your data files are uploaded on remote servers, and it is simpler to keep them there. In this case just put your url as the file path! We will read the file over a network connection according to the given url scheme.

Adding a remote file in Toucan

In order to add a remote file in Toucan, we will use the code mode of file settings.

Follow the following steps to have access to it:

  • Create an empty csv file on your computer

  • Upload your empty file (or any random csv file) within Toucan

  • Switch to code mode within the configuration interface

  • Replace the code block depending on the distant file server and the configuration. Refer to the sections below to determine the code you should use.

  • Save your file and confirm the saving. A new file should appear in the listing of files (in datasources). A dataset will be also automatically created.

For example this is how you can read a CSV file directly on Dropbox or your FTP server.

domain: 'my_remote_data'
type: 'csv'
file: 'https://www.dropbox.com/s/9yu9ekfjk8kmjlm/fake_data.csv?dl=1'

We support ftp (as well as sftp or ftps), http (and https), S3 and a long list of other schemes (‘mms’, ‘hdl’, ‘telnet’, ‘rsync’, ‘gopher’, ‘prospero’, ‘shttp’, ‘ws’, ‘https’, ‘http’, ‘sftp’, ‘rtsp’, ‘nfs’, ‘rtspu’, ‘svn’, ‘git’, ‘sip’, ‘snews’, ‘tel’, ‘nntp’, ‘wais’, ‘svn+ssh’, ‘ftp’, ‘ftps’, ‘file’, ‘sips’, ‘git+ssh’, ‘imap’, ‘wss’).

FTP Server

When your data files are too big to be transferred via the studio data upload interface, you can store them in a FTP server. The FTP server can either be in the Toucan Toco side (ask for support to set it up) or in your side.

Tutorial : Product Corporation

  • You need an access to your FTP server.

  • Connect to your FTP and look at what files are available.

  • Right-click on the one you want to use

  • Select “Copy URL(s) to clipboard”

  • Add your password to the url generated

Important

💡 If you don’t want to write the password in your etl_config file! Contact us via help@toucantoco.com or your Delivery contact to set up a hidden password.

  • Paste the url to your datasource block

  • Your datasource block is now ready

domain: 'db_test'
type: 'csv'
file: 'ftps://<login>:<password>@ftps.toucantoco.com:990/my_db.csv'
separator: ";"

Toucan Toco FTP Server

You can send data to Toucan Toco FTP Server with the following credentials:

  • Host: ftps.toucantoco.com

  • Port: 990 (for the connection) and range 64000-64321 (for data transfert)

  • Protocol: FTPS (if you use FileZilla it’s implicit FTP over TLS)

  • Mode: Passive Mode

  • User : Given by the Toucan Toco Team

  • Password: Given by the Toucan Toco Team

S3 Bucket

The access key and secret key for your data files hosted on S3 buckets can be configured this way:

s3://<access key>:<secret key>@mybucket/filename'

For example:

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/my_data.csv'
separator: ";"

Note

If your access key or secret key contains special characters such as “/”, “@” or “:” you have to encode them. URL encoding converts special characters into a format that can be transmitted over the Internet. You will find more infos about this topic here (as well as an automatic encoder).

Toucan Toco can provide a S3 bucket with a dedicated AWS IAM user related to your instance.

Thus you will be able to configure your datasources block with a special configuration as following:

domain: 'my_data'
type: 'csv'
file: "{{ secrets.extra.s3.s3_uri_auth_encoded }}/my_data.csv"
separator: ";"

Note

If you are using a custom domain name for your S3 bucket using minio per example. Here is the syntax you should use

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/my_data.csv'
separator: ";"
fetcher_kwargs:
    client_kwargs:
        endpoint_url: "https://endpoint.mydomain.com:9000"

Combining a remote file in Toucan

In the previous page, we saw how to add remote files in Toucan. Read the previous page first, before going further with this one.

In this page, we will discover how to combine several remote files into one file.

You can load multiple files - uploaded on our server or on a FTP/S3 server - in a unique file with the option match: true. The dataset that will be created from the file will contain a column __filename__ corresponding to the origin file of the row.

Tutorial

Your corporation has now a new file of data each month : data-product-corporation-201801.csv, data-product-corporation-201802.csv … You want them to be loaded in a single domain called data-product-corpo

  • Find the regular expression (regex) that matches your files with regex101.com.

data-product-corporation-\d{6}\.csv

  • Don’t forget to use ‘^’ and ‘$’ to be more restrictive.

^data-product-corporation-\d{6}\.csv$

  • Add a backslach to escape backslaches.

^data-product-corporation-\\d{6}\\.csv$

  • Copy your regular expression in the "file" option of your datasource block

  • Add the option match: true

domain: 'data-product-corpo'
file: '^data-product-corporation-\\d{6}\\.csv$'
skip_rows: 0
separator: ','
encoding: 'utf-8'
type: 'csv'
match: true

Example of content for FTP (with authentication):

domain: 'db_test'
type: 'csv'
file: 'ftps://<login>:<password>@ftps.toucantoco.com:990/^data-product-corporation-\\d{6}\\.csv$'
separator: ";"

Example of content for S3 (with authentication):

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/^data-product-corporation-\\d{6}\\.csv$'
separator: ";"
fetcher_kwargs:
    client_kwargs:
        endpoint_url: "https://endpoint.mydomain.com:9000"

Last updated