🔌Setting a Databricks Connector

Databricks provides an SQL API to read & extract data from clusters hosted on the main cloud providers. This connector provides an interface to execute SQL queries against this API.

This connector support ‘on-demand’ clusters i.e.: self stopping clusters. Make sure to tick the ON DEMAND parameter on the connector’s configuration form to handle queries on a stopped cluster.

Live datasets might not work properly in case of self stopped cluster

The relevant driver must be installed and configured on your Toucan Toco workspace

Configuring the Databricks connector in Toucan

Retrieve ODBC connection information from Databricks as described here

Fill connection parameters:

NAME: name given to your connector
HOST: usually in the format my-databricks-cluster.cloudprodiverdatabricks.net, you can retrieve it from your cluster’s configuration
PORT: default is 443
HTTPPATH: sql/protocol/v1/o/xxx/yyy, you can retrieve it from Databricks UI cluster’s configuration in the ‘ODBC’ section
PWD: your access token (generated from Databricks UI in user settings), usually in this format dapixxxxxx
ON DEMAND: if your cluster is self-stopping, make sure to tick this option. With this option, the connector will try to start the cluster if it’s stopped before any query
Then you can finally hit the TEST CONNECTION button

If the cluster is stopped, the connection test might fail, but you can SAVE the configuration anyway

After successfully configuring the connector, you will be able to find it in the Connector section of the DataHub "Datasource" tab

Selecting data from Databricks

⚠Please note that in case of a shutdown cluster, the query preview & live queries might be broken as of current state of the implementation. In such situations, the connector tries to start the cluster and wait for the cluster to be started. If you plan to use the connector in an ‘on-demand’ fashion (i.e.: with self-stopping clusters) use it only with stored datasets.

To create a dataset from Databricks, click on the "create from icon", you will then be able to:

QUERY: the SQL query you want to run
PARAMETERS (optional): dict, allows to parameterize the query.

We specifically designed this connector to handle DATA REFRESH from an on-demand clusters. During this process, the connector will try to start the cluster and wait for it to be ready before running queries.*

After selecting data from your connector you will be able to create a dataset thanks to YouPrep using the selection as "source step".

Last updated 3 months ago

Was this helpful?