🗃️Set validation rules

Add some data validation steps to detect errors in your datasources sooner.

When a new data file is dropped through the studio or extracted during preprocess, it can be validated. You can check the data types of the columns, the number of rows of a data set, the presence of some required columns, that there is no duplicated rows, etc.

The list of validation rules for data files are defined inside the etl_config.cson file under the validation key.

Example

domain: "market-breakdown"
type: "excel"
file: "data-market-breakdown.xls"
sheetname: "Data"
validation:[
  type: "data_type"
  expected:
    "breakdown": "string"
    "pays": "string"
    "part": "number"
    "clients": "number"
,
  type: "pattern"
  expected: "[0-9]+"
  params:
    columns: ['metric']
]

Rows

Check the number of expected rows.

Keys:

  • type: ‘rows’

  • expected: (number) number of expected rows

Columns

Ensure that a given list of columns is a subset of the dataset’s columns

Keys:

  • type: ‘columns’

  • expected: (list(str)) columns you expected to find

Unique values

Ensure that the list of unique values of a given column corresponds exactly to a list of expected values.

Keys:

  • type: ‘unique_values’

  • expected: (list) unique values

  • params:

  • column: (string) column name

No duplicates

Duplicated rows can be assessed based on all the columns or only a subset of columns.

Keys:

  • type: ‘no_duplicates’

  • params:

  • columns: (list or string) list of columns to use or ‘all’

Value

Check the value of a column (one value only). If the query returned more than one row, only the first one will be used.

Keys:

  • type: ‘value’

  • expected: (string or number) expected value

  • params:

  • column: (string) in which to check the value

Data type

Check column data types. Three possible types: number, string, date, or category.

Keys:

  • type: ‘data_type’

  • expected: : <’string’, ‘number’, ‘date’, or ‘category’>

Pattern

Check if string values correspond to a defined pattern

Keys:

  • type: ‘pattern’

  • expected: pattern/regex as a string

  • params: object with a columns key: the list of columns to check.

Not null

Check if some columns don’t have null value.

Keys:

  • type: ‘not_null’

  • params: object with a columns key: list of columns.

Tutorial : Product Corporation

You can download the CSV file for our tutorial.

data-product-corporation.csv

  • Add validation for your datasource in etl_config.cson

  • Drag and drop your new etl_config.cson in the CONFIG FILES page

  • Go to your ‘DATA SOURCES’ page and drop your datasource.

  • Validation should be ok. If not, the file is not uploaded.

Last updated

Was this helpful?