Validating CSV data with the Tabular validator¶
The Tabular validator checks tabular data — a table of typed rows — against rules you define. CSV is the file type it reads today (TSV, Excel, and Parquet are planned), which is why it's called "Tabular" rather than "CSV". A typical use is gating a delivery: a researcher sends a Darwin Core species export, a contractor sends a building energy meter file, and you want to confirm every column has the right type and every row makes sense before the data moves on.
The validator describes your columns with an open standard — the Frictionless Table Schema — the same way our SHACL validator uses SHACL shapes and our JSON validator uses JSON Schema. If you're new to Table Schema, the standard's documentation is a short, readable reference for the field types and constraints.
What you'll need¶
- A Validibot account with permission to author workflows.
- A workflow whose allowed file types include CSV (create one if you don't have it).
- Either a sample CSV to infer a schema from, or a Table Schema descriptor you already have.
How you describe your columns¶
A Table Schema is a small, plain-JSON description of your columns: each field's name, its type, and any constraints. A minimal descriptor looks like this:
{
"fields": [
{ "name": "site_id", "type": "string", "constraints": { "required": true } },
{ "name": "depth_m", "type": "number", "constraints": { "minimum": 0, "maximum": 11000 } },
{ "name": "recorded", "type": "date" },
{ "name": "status", "type": "string", "constraints": { "enum": ["ok", "flagged"] } }
],
"primaryKey": ["site_id"]
}
You don't have to write this by hand. When you set up the step you supply the schema one of two ways:
- Infer it from a sample file (the quick path). Upload a representative CSV and Validibot reads the header, works out the delimiter, and guesses each column's type from its values. Treat the result as a starting point — it picks types only, so you then add the ranges, allowed values, and "required" flags that matter to you.
- Paste or import a descriptor. If you already have a Table Schema descriptor (many open-data portals and research data packages ship one), paste it in and the column rules populate directly.
Because Validibot reads and exports the standard descriptor, a schema you build here is portable: you can export it and reuse it elsewhere, or bring one in from another tool.
Setting up a Tabular step¶
- Open the workflow editor and click Add step.
- Pick Tabular Validator from the library.
- Give the step a name like "Occurrence file check" and a one-line description.
- Set the dialect — the delimiter (comma by default) and whether the file has a header row. If you're unsure, Validibot can sniff the delimiter from a sample.
- Supply the column schema by either uploading a sample CSV to infer one, or pasting a Table Schema descriptor.
- Review the inferred or imported columns and tighten the constraints — add ranges, allowed values, patterns, and which columns are required or unique.
- Click Save step.
- Open the saved step and click Add assertion to add any cross-field row rules (see below).
Make sure CSV is enabled in the workflow's allowed file types, or the Tabular validator won't be selectable on the step.
Two kinds of rule¶
A Tabular step checks your data in two complementary ways, and both feed the same findings list.
Column rules come straight from the schema and run automatically — required, type, numeric range, string length, regex pattern, an allowed set of values, and uniqueness (including composite primary keys). These are the per-column checks a schema is good at.
Row rules are for everything a column-by-column schema can't express:
comparisons across columns, or conditional logic. You write them as
CEL expressions on the step using the row.
prefix. For example:
row.min_depth <= row.max_depth // one column must not exceed another
row.recorded <= now() // a date can't be in the future
row.count == 0 ? row.status == "absent" : true // conditional rule
A row rule may only reference a column you've declared in the schema, so a mistyped or missing column name is caught when you save the step — not on the next run.
Reviewing a failed run¶
When a submission fails, the run detail page lists the findings. Tabular
findings are summarised, not per-row: a single failed check produces one
finding carrying the total number of failures plus a sample of the offending
row numbers. So a column with a million bad cells gives you one readable finding
("1,204,883 values failed depth_m range, e.g. rows 12, 88, 415…"), not a
million lines to scroll.
Pass the findings back to whoever sent the file, with the specific column and the sample rows, and ask for a corrected delivery.
Determinism¶
The validator reads every cell as text and coerces it the same way on every machine — no locale-dependent number parsing, no silent "NaN" substitution, and dates are ISO 8601 only. Two people validating the same file get the same result, which is what lets a downstream signed credential stand behind it.
A consequence worth knowing: the validator is strict by design. A ragged row or an unbalanced quote fails the read with a clear message rather than being quietly repaired. It's a quality gate, not a data cleaner.
Where to learn more¶
- Table Schema standard — field types and constraint keywords are at datapackage.org.
- CEL expressions — see CEL Expressions for writing row rules.
- Running validations — see Run Validations for how to submit a file once your workflow is set up.