🔌Add a Databricks connector
Databricks provides an SQL API to read & extract data from clusters hosted on the main cloud providers. This connector provides an interface to execute SQL queries against this API.
This connector support ‘on-demand’ clusters i.e.: self stopping clusters. Make sure to tick the ON DEMAND
parameter on the connector’s configuration form to handle queries on a stopped cluster.
Live datasets might not work properly in case of self stopped cluster
The relevant driver must be installed and configured on your Toucan Toco workspace
Configuring a Databricks connection in Toucan
Retrieve ODBC connection information from Databricks as described here
Name (mandatory)
String
Use it to identify your connection
MyDatabricksConnection
Host (mandatory)
String
hostname of databricks cluster can be found the cluster configuration
my-databricks-cluster.cloudprodiverdatabricks.net
Port (mandatory)
Integer
The listening port of your Databricks cluster
443 (default)
Http Path (mandatory)
String
Databricks compute resources URL, can be retrieved from Databricks UI cluster’s configuration in the ‘ODBC’ section
sql/protocol/v1/o/xxx/yyy
User (mandatory)
String
token
"if you use a personal access token PAT,
or username if you connect by username/password (deprecated since July 2024)
databricks_user
Password (mandatory)
String
Access token (generated from Databricks UI in user settings) (will be stored as a secret)
dapixxxxxx
ANSI
Boolean
Enforce compliance with the ANSI SQL standard for SQL operations and behaviors
On Demand
Boolean
if your cluster is self-stopping, make sure to tick this option. With this option, the connector will try to start the cluster if it’s stopped before any query
Retry Policy (optional)
Boolean
Boolean allows to configure a retry policy if the connection is flaky.
max attempts: maximum number of retries before giving up
max_delay: in seconds, above the connection is dropped
wait_time: time in seconds between each retry
Slow Queries' Cache Expiration Time
Integer
Slow queries' cache expiration time
Click on the TEST CONNECTION
button then SAVE
the connection
After successfully configuring the connector, you will be able to find it in the Connector section of the DataHub "Datasource" tab
If the cluster is stopped, the connection test might fail, but you can SAVE
the configuration anyway
Create a dataset from a Databricks connection
Please note that in case of a shutdown cluster, the query preview & live queries might be broken as of current state of the implementation. In such situations, the connector tries to start the cluster and wait for the cluster to be started. If you plan to use the connector in an ‘on-demand’ fashion (i.e.: with self-stopping clusters) use it only with stored datasets.
To create a dataset from Databricks, click on the "create from icon", you will then be able to:
QUERY
: the SQL query you want to runPARAMETERS
(optional): dict, allows to parameterize the query.
After selecting data from your connector you will be able to create a dataset thanks to YouPrep using the selection as "source step".
Last updated
Was this helpful?