# Setting up an HTTP API connector

## Connector features

You can use the Toucan HTTP API connector to connect to:

* a REST API
* a GraphQL API
* an ElasticSearch
* Any service that allows to access to HTTP API

With an HTTP API connection, you can fetch data from your API to fill your charts and dashboards.

{% hint style="info" %}
**Changelog**\
**November 2024**

* This connector supports pagination

**February 2025**

* Enhanced support of oAuth2.0 authentications
  {% endhint %}

This type of data source combines the features of Python’s [requests](http://docs.python-requests.org/) library to get data from any API with the filtering langage [jq](https://stedolan.github.io/jq/) for flexible transformations of the responses. Optionally, an [xpath](https://developer.mozilla.org/en-US/docs/Web/XPath) string can be provided to first parse the XML response and then the jq filter is be applied to get the data in tabular format.

## Configuring the connector

{% hint style="warning" %}
To configure this connector you will need to use the documentation of the API you need to connect to
{% endhint %}

### **Responsetype**

The type of response the connector has to expect from the queried API.

{% hint style="warning" %}
Make sure you use the correct `responsetype`, based on the queried API’s documentation. Currently JSON & XML are supported, the default being JSON.
{% endhint %}

### **Retrypolicy**

Defines how the connector should behave when the network is unreachable:

* `MAX ATTEMPTS`: number of attempts to do before aborting the connexion
* `MAX DELAY`: total time to wait before aborting the connexion
* `WAIT TIME`: time to wait between each attempt

### **Certificate**

If the connector must use a certificate to establish the connexion, you can provide the path to the certificate.

### **Auth**

The authentication method that the connector should use to query the data. `AUTHTYPE` Can be:

* `basic`: username password, you can provide them in
  * `positional arguments`: input your *username* and *password* in the right order
  * `named arguments`: input them this way *{“username”:”myusername”, “password”:”mypassword”}*
* `digest`: same as above
* `oAuth1`:
  * `positional arguments`: input *client\_id* (sometimes named *client\_key*) and *client\_secret*. Both are provided by the service you are trying to access
  * `named arguments`: input *{“client\_id”:your\_client\_id, “client\_secret”: your\_client\_secret}*.
* `oAuth2`: *(deprecated)*
  * `positional arguments`: enter one by one (in the right order), the URL to access to the authentication endpoint (e.g. <https://login.mywebsite.com/oauth2/token>), the “client\_ID” (sometimes named “client\_key”) and the “client\_secret”. These informations are provided by the service you are trying to access
  * `named arguments`: input *{“client\_id”:your\_client\_id, “client\_secret”: your\_client\_secret}*.
* `CustomTokenServer:` provides a flexible mechanism for authenticating API requests using a custom token server. the token you get is then sent in the the `Authorization` header prefixed with "`Bearer` `{{your_token}}"` . In the `named arguments` section you have to fill as a json dict the required elements to get your token:
  * `method`: The HTTP method to use when requesting the token (e.g., 'GET', 'POST').
  * `url`: The URL to get the token server.
  * `params` (optional): Query parameters to include in the token request.
  * `data` (optional): Form data to include in the token request body.
  * `headers` (optional): Additional headers to include in the token request.
  * `json` (optional): JSON payload to include in the token request body.
  * `token_header_name`: allows to override the default `Authorization` header.
  * `filter` (optional): A JQ-style filter to extract the token from the response. Defaults to "." (root of the JSON response).

### Authentication

We have added a dedicated section to manage OAuth 2.0 authentication for REST APIs. This authentication method enables users to authenticate with a third-party service (an OAuth 2.0 provider). Upon request from our backend, the provider issues a token with a specific scope. This token is then used in the `Authorization` header with the `Bearer` scheme to authenticate and access your data on the API. For more detailed information, please refer to the [OAuth2.0 standards](https://oauth.net/2/).

For now we only support the `Grant Type: Authorization Code`. This section outlines the fiels available for configuring this method

* `Configuration Type` (dropdown list) : `AuthorizationCodeOauth2` (only option available for now)
  * `Authentication URL` (mandatory): the URL used to initiate the OAuth2.0 authorization process. For example:`https://auth.api-acme.com/oauth/authorize`
  * `Token URL` (mandatory): The URL used to exchange the authorization code for an access token. For example: `https://auth.api-acme.com/oauth/token`
  * `Scope` (mandatory): The permissions requested from the OAuth2.0 provider. For example: `read write profile`
  * `Additional authentication params` (optional): a JSON object containing additional URL parameters to be included in the authentication request . For example:`{"add_param1": "value_2", "add_param2": "value_2"}`
  * `Client Id` (mandatory): The unique identifier for your application, provided by the OAuth2.0 service. For example: `client_abc123`
  * `Client Secret` (mandatory): The secret key associated with your client ID. For example: `secret_xyz789`

{% hint style="info" %}
**Redirect URL**

Make sure you authorize this URL in your OAuth2.0 provider, it is necessary to complete the OAuth2.0 process of exchanging information

[**https://api-{{my-workspace}}.toucantoco.guru/{{my-app}}/connectors/http/authentication/redirect**](https://api-{{myworkspace}}.toucantoco.com/%7B%7Bmy-app%7D%7D/connectors/http/authentication/redirect)
{% endhint %}

{% hint style="info" %}
**Additional authentication params**

**Some OAuth2.0 providers can ask for additional parameters in the request. By default we only send the following fields**

* client id (for example `client_abc123`)
* redirect\_uri (for example **`https://api-{{my-workspace}}.toucantoco.guru/{{my-app}}/connectors/http/authentication/redirect`**)
* response\_type (for example `code`)
* scope (for example `read write profile`)
* state (for example `xyz123securestate` which is a random string for CSRF protection)

Google as an [OAuth2.0 provider](https://developers.google.com/identity/protocols/oauth2/web-server#creatingclient) requires other parameters, to access to Google API that requires OAuth2.0 as a mean of authentication you will have to fill the Additional authentication params with the following json

<kbd>{</kbd>

<kbd>"prompt": "consent",</kbd>

<kbd>"access\_type": "offline",</kbd>

<kbd>"response\_type": "code"</kbd>

<kbd>}</kbd>
{% endhint %}

### **Template**

You can use this object to avoid repetition in data sources. The values of the three attributes will be used or overridden by all data sources using this connector.

* `json`: a JSON object of parameters to send in the **body** of every HTTP request made using the configured connector. *Example: { “offset”: 100, “limit”: 50 }*
* `headers`: a JSON object of parameters to send in the **header** of every HTTP request made using the configured connector. *Example: { “content-type”: “application/xml” }*
* `params`: a JSON object of parameters to send in the **query string** of every HTTP request made using the configured connector. *Example: { “offset”: 100, “limit”: 50}*
* `proxies`: JSON object expressing a mapping of protocol or host to corresponding proxy. *Example {“http”: “foo.bar:3128”, “<http://host.name”>: “foo.bar:4012”}*

## **Selecting data from the API**

**Endpoint URL**

* `url`: The API’s endpoint you want to query, it will be appended to the baseroute URL defined in the connector ⚠️ as it cannot be empty in the case when the API doesn’t have endpoint, you can split the baseroute url defined in the connector and put the last part in the datasource. Ex: <https://example.com/API> in connector and /v1 in datasource

#### **Endpoint parameters**

* **`Method`: Defines the http method you want the datasource to perfom, GET, POST or PUT. Default is GET. You can find the method you need in the documentation of the API you want to query**
* `headers`: a JSON object of parameters to send in the **header** of every HTTP request made using the configured connector. *Example: { “content-type”: “application/xml” }*. Overwrites the header’s parameter in Template
* `URL params`: a JSON object of parameters to send in the **query string** of every HTTP request made using the configured connector. *Example: { “offset”: 100, “limit”: 50}* Overwrites the params parameter in Template
* `Body`: a JSON object of parameters to send in the **body** of every HTTP request made using the configured connector. *Example: { “data”: “my\_parameters” }*.

**Advanced**

* `parameters`: A JSON object that will be used for variables interpolation in the query string. For testing purpose only. In production mode, it should be left blank as variable interpolation will be handled by the app requester.
* `json`: a JSON object of parameters to send in the **body** of every HTTP request made using the configured connector. *Example: { “offset”: 100, “limit”: 50 }* Overwrites the JSON parameter in Template
* `proxies`: JSON object expressing a mapping of protocol or host to corresponding proxy. *Example {“http”: “foo.bar:3128”, “<http://host.name”>: “foo.bar:4012”}* Overwrites the proxies parameter in Template
* `flatten column`: optional field where you can specify the name of a column that contains nested rows. the column names in the resulting DataFrame will be prefixed with the original column name. Specified more parameters using a `,` delimiter. If specified, the nested rows will be flattened into separate columns in the resulting data frame. *Example if you have a column orders: \[{"id": 3, "product": "Notebook", "price": 5.99}] results will be separated in orders\_id, orders\_product and orders\_price*
* `data`: Two options, Type1 for a simple string, Type2 for a JSON field. 💡 you can send XML data with Type1 option
* `xpath`: If the reply from the API contains XML data you can parse it with an xpath string. See documentation: [xpath](https://developer.mozilla.org/en-US/docs/Web/XPath) Example:

  ```
  <?xml version="1.0" encoding="UTF-8"?>
  <result>
  <bookstore>
      <book>
          <title>Harry Potter</title>
          <price>29.99</price>
      </book>
      <book>
          <title>Learning XML</title>
          <price>39.95</price>
      </book>
  </bookstore>
  </result>
  ```

In the connector we’ll have a response like this:

```
{"bookstore": {"book": [{"title":"Harry Potter", "price": "29.99"}, {"title": "Learning XML", "price":"39.95"}]}}
```

And we can then apply a:

* `filter`: String containing a jq filter applied to the data to get them in tabular format. See documentation: [jq](https://stedolan.github.io/jq/) Example:

  ```
  filter: ".bookstore.book[]"
  ```

Let’s take the JSON defined above

```
{"bookstore": {"book": [{"title":"Harry Potter", "price": "29.99"}, {"title": "Learning XML", "price":"39.95"}]}}
```

We apply the filter “.bookstore.book\[]” which means that it will extract the `book` list from the `bookstore` So we end up with a table of results looking like this:

| title        | price |
| ------------ | ----- |
| Harry Potter | 29.99 |
| Learning XML | 39.95 |

Note: the reason to have a `filter` option is to allow you to take any API response and transform it into something that fits into a column based data frame.

### Pagination

This section presents the pagination support of Toucan. Pagination options allows to setup a configuration which will loop the results of a query until all results are retrieved.

{% hint style="warning" %}
Throttling and large datasets\
**Throttling**

We do not support *throttling* meaning that we do not have a speed limit feature when we request an API. This means we cannot control how quickly requests are sent. As a result, if too many requests are made too quickly, it might trigger an error message saying the system is overloaded.\
\
**Large datasets**\
Toucan execution preview calls are synchronous, which means that we only have 30 seconds to fetch and transform data. Depending in the query, it could be an issue if you are working on live data, prefer store datasets if it is the case.
{% endhint %}

#### Pagination configuration types

**Offset Limit (OffsetLimitPaginationConfig)**

This configuration type implements the offset/limit pagination pattern.

**Parameters**

* `offset_name`: (string) Parameter name for offset (default: `offset`)
* `limit_name`: (string) Parameter name for limit (default: "limit")
* `limit`: (int) **mandatory** Number of items per request
* `data_filter:` (string) **`mandatory`** offset pagination config field to determine which part of data must be used to compute the data length in the form of a JQ filter

Use case: APIs using offset/limit style pagination.

<details>

<summary>offset-limit example</summary>

Let's take the following configuration

* "offset\_name": "custom\_offset"
* "limit\_name": "custom\_limit"
* "limit": 50
* "data filter": ".items"

![](https://1809014303-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZxYYf1KpgarKMgMsDCrw%2Fuploads%2Fgit-blob-85f6f830f599409039ea2e239ffadb8e807b5fb3%2FCapture%20d%E2%80%99e%CC%81cran%202024-11-14%20a%CC%80%2011.00.03.png?alt=media)

We will perform the following calls:

* `https://my-api.com/data?custom_limit=50&custom_offset=0`
* `https://my-api.com/data?custom_limit=50&custom_offset=49`
* `https://my-api.com/data?custom_limit=50&custom_offset=99`
* `https://my-api.com/data?custom_limit=50&custom_offset=149`\
  ...

until there is no more page to access to.

</details>

**Page-based pagination (PageBasedPaginationConfig)**

This configuration implements page-based pagination

**Parameters**:

* `page_name`: (string) Parameter name for the page (default: `page`)
* `page`: (int) **mandatory** Current page number
* `per_page_name`: (string) Parameter name for items per page
* `per_page`: (int) Number of items per page
* `max_page_filter`: (string) JQ filter to extract maximum page number
* `can_raise_not_found`: (boolean) Whether 404 errors should be treated as end of pagination, must be set if no `max_page_filter` is available

**Use case**: Traditional APIs using page numbers where the information can be found in the response body.

<details>

<summary>page-based example</summary>

Let's take the following configuration

* "page\_name": "custom\_page"
* "page": 1
* "per\_page\_name": "custom\_per\_page"
* "per\_page": 100
* "max\_page\_filter": ".infos.last\_page"
* "Can Raise Not Found": False

![](https://1809014303-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZxYYf1KpgarKMgMsDCrw%2Fuploads%2Fgit-blob-df047df36d3ae4fe802a0543b19c1639b7485a5f%2FCapture%20d%E2%80%99e%CC%81cran%202024-10-29%20a%CC%80%2017.12.39.png?alt=media)

We will perform the following calls:

* `https://my-api.com/data?custom_page=1&custom_per_page=100`
* `https://my-api.com/data?custom_page=2&custom_per_page=100`

Until we reach the last page indicated in `max_page_filter` and stop the data fetching.

For a configuration as below, where there is no per\_page parameter to set and no information related to the last page in the response body. The configuration will look like this:

* "page\_name": "page"
* "page": 1
* "per\_page\_name": ""
* "per\_page":
* "max\_page\_filter": ""
* Can Raise Not Found: True

![](https://1809014303-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZxYYf1KpgarKMgMsDCrw%2Fuploads%2Fgit-blob-871f5e54b16725af8f075adf201a57bf5c25a707%2FCapture%20d%E2%80%99e%CC%81cran%202024-10-29%20a%CC%80%2017.10.49.png?alt=media)

We will perform the following calls:

* `https://my-api.com/data?page=1`
* `https://my-api.com/data?page=2`

Until we reach a 404 when no page will return us then we will stop the data fetching.

</details>

**Cursor based pagination (CursorBasedPaginationConfig)**

This configuration implements cursor-based pagination

**Parameters**:

* `cursor_name`: (string) **mandatory** Parameter name for the cursor (default: `cursor`)
* `cursor_filter`: (string) **mandatory** JQ filter to extract next cursor

**Use case**: APIs using cursors/tokens for pagination.

<details>

<summary>cursor-based example</summary>

Let's take the following configuration

* "cursor\_name": "token"
* "cursor\_filter": ".metadata.next\_cursor"

![](https://1809014303-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZxYYf1KpgarKMgMsDCrw%2Fuploads%2Fgit-blob-65615fdcc4041c7933e0db9485d42622c43b546a%2FCapture%20d%E2%80%99e%CC%81cran%202024-10-29%20a%CC%80%2017.20.44.png?alt=media)

We will perform the following call:

* `https://my-api.com/data`

`{`

`"data": [`

`... // API data`

`],`

`"metadata": {`

`"next_cursor": "abcde12345"`

`}`

`}`

* `https://my-api.com/data?token=abcde12345`

Until the next cursor is null

</details>

**Hyper Media Pagination (HyperMediaPaginationConfig)**

This configuration implements HATEOAS-style pagination using next links.

{% hint style="warning" %}
For this pagination type, all URLs need to have the same `base_url` configured. if the configured `base_url` is `https://my-api.com/data` then all next page urls must be at least `https://my-api.com/data/_whatever`
{% endhint %}

**Parameters**:

* `next_link_filter`: **mandatory (string)** JQ filter to extract next page URL
* `next_link`: **mandatory (string)** field which bears the next link URL

**Use case**: RESTful APIs following HATEOAS principles.

<details>

<summary>Hyper Media pagination example</summary>

Let's take the following configuration:

* "next\_link\_filter": ".metadata.next\_page"
* "next\_link": "next"

![](https://1809014303-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZxYYf1KpgarKMgMsDCrw%2Fuploads%2Fgit-blob-d0332da235a4f8b26da80ab95a00d2a767d38448%2FCapture%20d%E2%80%99e%CC%81cran%202024-10-29%20a%CC%80%2017.22.32.png?alt=media)

We will perform the following call:

* `GET https://my-api.com/data`

// response example

`{`

`"data": [`

`... // API data`

`],`

`"metadata": {`

`"next_page": "https://my-api.com/data/next/page/2?auth_token=4321"`

`}`

`}`

* `GET https://my-api.com/data/next/page/2?auth_token=4321`

Until the next page URL is null

</details>

## Example of connection to Open Data Paris

### Setting up the connection to Open Data Paris

```
name: open-data-paris
baseroute: https://opendata.paris.fr/api/
```

### Selecting data from Open Data Paris

```
Dataset: books
Method: GET
URL: records/1.0/search/
Dataset: les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret
Facet: auteur
Filter: .records[].fields
```

The JSON response looks like this:

{% code overflow="wrap" %}

```json
json   {     "nhits": 1000,     "parameters": { ... },     "records": [       {         "datasetid": "les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret",         "recordid": "4b950c1ac5459379633d74ed2ef7f1c7f5cc3a10",         "fields": {           "nombre_de_reservations": 1094,           "url_de_la_fiche_de_l_oeuvre": "https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1009613",           "url_de_la_fiche_de_l_auteur": "https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1009613",           "support": "indéterminé",           "auteur": "Enders, Giulia",           "titre": "Le charme discret de l'intestin [Texte imprimé] : tout sur un organe mal aimé"         },         "record_timestamp": "2017-01-26T11:17:33+00:00"       },       {         "datasetid":"les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret",         "recordid":"3df76bd20ab5dc902d0c8e5219dbefe9319c5eef",         "fields":{           "nombre_de_reservations":746,           "url_de_la_fiche_de_l_oeuvre":"https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1016593",           "url_de_la_fiche_de_l_auteur":"https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1016593",           "support":"Bande dessinée pour adulte",           "auteur":"Sattouf, Riad",           "titre":"L'Arabe du futur [Texte imprimé]. 2. Une jeunesse au Moyen-Orient, 1984-1985"         },         "record_timestamp":"2017-01-26T11:17:33+00:00"       },       ...     ]   }
```

{% endcode %}

We apply the filter `.records[].fields` which means that for every entry in the `records` property, it will extract all the properties of the `fields` object. So we end up with a table of results looking like this (I’m skipping columns in this example, but you see the point):

| nombre\_de\_reservations | auteur         | skipped columns… |
| ------------------------ | -------------- | ---------------- |
| 1094                     | Enders, Giulia | …                |
| 746                      | Sattouf, Riad  | …                |

{% hint style="info" %}
**Note**: the reason to have a `filter` option is to allow you to take any API response and transform it into something that fits into a column-based data frame. jq is designed to be concise and easy for simple tasks, but if you dig a little deeper, you’ll find a feature functional programming language hiding underneath.
{% endhint %}

{% hint style="warning" %}
**Performance**\
If the HTTP API connector is used in a live context, make sure that the API is performant enough and is able to retrieve data fast. In order to have suitable performance, make sure to retrieve a limited amount of data since its need additional transformation in order to unnest the data (in the case of json response).
{% endhint %}

{% hint style="success" %}
After selecting data from your connector you will be able to create a dataset thanks to [YouPrep](https://docs-v3.toucantoco.com/data-management-in-datahub/datasets-in-toucan/preparing-data/overview-of-youprep-tm) using the selection as "source step".
{% endhint %}
