🗃️Set validation rules

Add some data validation steps to detect errors in your datasources sooner.

When a new data file is dropped through the studio or extracted during preprocess, it can be validated. You can check the data types of the columns, the number of rows of a data set, the presence of some required columns, that there is no duplicated rows, etc.

The list of validation rules for data files are defined inside the etl_config.cson file under the validation key.

Example

domain: "market-breakdown"
type: "excel"
file: "data-market-breakdown.xls"
sheetname: "Data"
validation:[
  type: "data_type"
  expected:
    "breakdown": "string"
    "pays": "string"
    "part": "number"
    "clients": "number"
,
  type: "pattern"
  expected: "[0-9]+"
  params:
    columns: ['metric']
]

Rows

Check the number of expected rows.

Keys:

type: ‘rows’
expected: (number) number of expected rows

Columns

Ensure that a given list of columns is a subset of the dataset’s columns

Keys:

type: ‘columns’
expected: (list(str)) columns you expected to find

type: 'columns'
expected: ['metric', 'target']

Unique values

Ensure that the list of unique values of a given column corresponds exactly to a list of expected values.

Keys:

type: ‘unique_values’
expected: (list) unique values
params:
column: (string) column name

type: 'unique_values'
expected: ["France", "Germany", "Spain"]
params:
    column: 'target'

No duplicates

Duplicated rows can be assessed based on all the columns or only a subset of columns.

Keys:

type: ‘no_duplicates’
params:
columns: (list or string) list of columns to use or ‘all’

type: 'no_duplicates'
params:
    columns: ['metric']

Value

Check the value of a column (one value only). If the query returned more than one row, only the first one will be used.

Keys:

type: ‘value’
expected: (string or number) expected value
params:
column: (string) in which to check the value

type: 'value'
expected: 30
params:
    column: 'target'

Data type

Check column data types. Three possible types: number, string, date, or category.

Keys:

type: ‘data_type’
expected: : <’string’, ‘number’, ‘date’, or ‘category’>

type: 'data_type'
expected:
    'metric': 'number'
    'target': 'number'

Pattern

Check if string values correspond to a defined pattern

Keys:

type: ‘pattern’
expected: pattern/regex as a string
params: object with a columns key: the list of columns to check.

type: 'pattern'
expected: '[0-9]+'
params:
    columns: ['metric']
]

Not null

Check if some columns don’t have null value.

Keys:

type: ‘not_null’
params: object with a columns key: list of columns.

type: 'not_null'
params:
    columns : ['zone']

Tutorial : Product Corporation

You can download the CSV file for our tutorial.
data-product-corporation.csv
Add validation for your datasource in etl_config.cson
DATA_SOURCES: [
 {
   domain: 'data-product-corpo'
   file: 'data-product-corporation.csv'
   skip_rows: 0
   separator: ','
   encoding: 'utf-8'
   type: 'csv'
   validation:[
  type: 'rows'
  expected: 46
   ,
  type: 'columns'
  expected: ['brand', 'country', 'product_line']
   ]
 }
]
Drag and drop your new etl_config.cson in the CONFIG FILES page
Go to your ‘DATA SOURCES’ page and drop your datasource.
Validation should be ok. If not, the file is not uploaded.

Last updated 2 years ago

Was this helpful?

hashtagExample

hashtagRows

hashtagColumns

hashtagUnique values

hashtagNo duplicates

hashtagValue

hashtagData type

hashtagPattern

hashtagNot null

Example

Rows

Columns

Unique values

No duplicates

Value

Data type

Pattern

Not null