📁Adding and combining remote files in Toucan

Toucan ca read files over a network connection for a large variety of protocols.

Adding a remote file in Toucan

In order to add a remote file in Toucan, use the code mode of file settings.

Follow the following steps to access it:

  • Upload a csv empty file (or any small random csv file) within Toucan

  • Switch to code mode within the configuration interface

  • Replace the fields of the code block depending on the distant file server and configuration associated. Refer to the sections below the fields to fill.

  • Save File settings. A new file should appear in the listing of files (in datasources). A dataset will be also automatically created.

Example with a CSV file on Dropbox.

domain: 'my_remote_data'
type: 'csv'
file: 'https://www.dropbox.com/s/9yu9ekfjk8kmjlm/fake_data.csv?dl=1'

We support:

  • ftp (as well as sftp or ftps),

  • http (and https),

  • S3 and

  • a long list of other schemes (‘mms’, ‘hdl’, ‘telnet’, ‘rsync’, ‘gopher’, ‘prospero’, ‘shttp’, ‘ws’, ‘https’, ‘http’, ‘sftp’, ‘rtsp’, ‘nfs’, ‘rtspu’, ‘svn’, ‘git’, ‘sip’, ‘snews’, ‘tel’, ‘nntp’, ‘wais’, ‘svn+ssh’, ‘ftp’, ‘ftps’, ‘file’, ‘sips’, ‘git+ssh’, ‘imap’, ‘wss’).

FTP Server

  • Mandatory: access to a FTP server and to Toucan staging mode on your workspace

  • Open Filezilla or any FTP client

  • Copy the URL corresponding to the location of the file on the FTP server it should look like this:

ftp://user:password@example.com/pub/file.txt

"ftp" is the protocol used, "user" and "password" are the login credentials "example.com" is the domain of the server, and "/pub/file.txt" is the full path to the file on the server.

Important

💡Contact us via help@toucantoco.com or your Delivery contact to set up a hidden password in the URL

  • Paste this url in the file field and modify the other configuration fields configuration as explained above in Adding a remote file in Toucan

  • A new file should appear in the listing of files (in datasources). A dataset will be also automatically created.

domain: 'db_test'
type: 'csv'
file: 'ftps://<login>:<password>@ftps.toucantoco.com:990/my_db.csv'
separator: ";"

Toucan Toco FTP Server

You can send data to Toucan Toco FTP Server with the following credentials:

  • Host: ftps.toucantoco.com

  • Port: 990 (for the connection) and range 64000-64321 (for data transfert)

  • Protocol: FTPS (if you use FileZilla it’s implicit FTP over TLS)

  • Mode: Passive Mode

  • User : Given by the Toucan Toco Team

  • Password: Given by the Toucan Toco Team

S3 Bucket

The access key and secret key for your data files hosted on S3 buckets can be configured this way:

s3://<access key>:<secret key>@mybucket/filename'

For example:

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/my_data.csv'
separator: ";"

Note

If your access key or secret key contains special characters such as “/”, “@” or “:” you have to encode them. URL encoding converts special characters into a format that can be transmitted over the Internet. You will find more infos about this topic here (as well as an automatic encoder).

Toucan Toco can provide a S3 bucket with a dedicated AWS IAM user related to your instance.

Thus you will be able to configure your datasources block with a special configuration as following:

domain: 'my_data'
type: 'csv'
file: "{{ secrets.extra.s3.s3_uri_auth_encoded }}/my_data.csv"
separator: ";"

Note

If you are using a custom domain name for your S3 bucket using minio per example. Here is the syntax you should use

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/my_data.csv'
separator: ";"
fetcher_kwargs:
    client_kwargs:
        endpoint_url: "https://endpoint.mydomain.com:9000"

Combining a remote file in Toucan

In the previous page, we saw how to add remote files in Toucan. Read the previous page first, before going further with this one.

In this page, we will discover how to combine several remote files into one file.

You can load multiple files - uploaded on our server or on a FTP/S3 server - in a unique file with the option match: true. The dataset that will be created from the file will contain a column __filename__ corresponding to the origin file of the row.

Tutorial

Your corporation has now a new file of data each month : data-product-corporation-201801.csv, data-product-corporation-201802.csv … You want them to be loaded in a single domain called data-product-corpo

  • Find the regular expression (regex) that matches your files with regex101.com.

data-product-corporation-\d{6}\.csv

  • Don’t forget to use ‘^’ and ‘$’ to be more restrictive.

^data-product-corporation-\d{6}\.csv$

  • Add a backslach to escape backslaches.

^data-product-corporation-\\d{6}\\.csv$

  • Copy your regular expression in the "file" option of your datasource block

  • Add the option match: true

domain: 'data-product-corpo'
file: '^data-product-corporation-\\d{6}\\.csv$'
skip_rows: 0
separator: ','
encoding: 'utf-8'
type: 'csv'
match: true

Example of content for FTP (with authentication):

domain: 'db_test'
type: 'csv'
file: 'ftps://<login>:<password>@ftps.toucantoco.com:990/^data-product-corporation-\\d{6}\\.csv$'
separator: ";"

Example of content for S3 (with authentication):

domain: 'my_data'
type: 'csv'
file: 's3://<access key>:<secret key>@mybucket/^data-product-corporation-\\d{6}\\.csv$'
separator: ";"
fetcher_kwargs:
    client_kwargs:
        endpoint_url: "https://endpoint.mydomain.com:9000"

Last updated