Add some data validation steps to detect errors in your datasources sooner.
When a new data file is dropped through the studio or extracted during preprocess, it can be validated. You can check the data types of the columns, the number of rows of a data set, the presence of some required columns, that there is no duplicated rows, etc.
The list of validation rules for data files are defined inside the etl_config.cson file under the validation key.
Copy domain: "market-breakdown"
type: "excel"
file: "data-market-breakdown.xls"
sheetname: "Data"
validation:[
type: "data_type"
expected:
"breakdown": "string"
"pays": "string"
"part": "number"
"clients": "number"
,
type: "pattern"
expected: "[0-9]+"
params:
columns: ['metric']
] Check the number of expected rows.
Keys:
expected: (number) number of expected rows
Ensure that a given list of columns is a subset of the dataset’s columns
Keys:
expected: (list(str)) columns you expected to find
Ensure that the list of unique values of a given column corresponds exactly to a list of expected values.
Keys:
expected: (list) unique values
column: (string) column name
Duplicated rows can be assessed based on all the columns or only a subset of columns.
Keys:
columns: (list or string) list of columns to use or ‘all’
Check the value of a column (one value only). If the query returned more than one row, only the first one will be used.
Keys:
expected: (string or number) expected value
column: (string) in which to check the value
Check column data types. Three possible types: number , string , date , or category .
Keys:
expected: : <’string’, ‘number’, ‘date’, or ‘category’>
Check if string values correspond to a defined pattern
Keys:
expected: pattern/regex as a string
params: object with a columns key: the list of columns to check.
Check if some columns don’t have null value.
Keys:
params: object with a columns key: list of columns.
Tutorial : Product Corporation
You can download the CSV file for our tutorial.
data-product-corporation.csvarrow-up-right
Add validation for your datasource in etl_config.cson
Drag and drop your new etl_config.cson in the CONFIG FILES page
Go to your ‘DATA SOURCES’ page and drop your datasource.
Validation should be ok. If not, the file is not uploaded.