Machine-readable output format of record validation #116

nichtich · 2021-08-18T08:57:44Z

To further process validation results.

What should be encoded in record validation output?

There are syntax errors when a record could not be parsed (e.g. broken character encoding, pseudo-tags or subfield codes, garbage etc.: for instance a field with PICA tag 037l, typo of 037I or 037L) and validation errors.

Questions:

Which format? TAP format is not verbose enough, JUnit XML cannot be streamed and is not strictly specified but most common nevertheless.
What information must be recorded?
- Which record (number and/or record identifier)
- Which field(s) and subfield(s)
- Type of error

Some error types can be found here. Do we need an identifier for each error type? Here an (possibly incomplete) list of error types:

syntax errors
- syntax error in record (full record could not be parsed)
- syntax error in field (field could not be parsed)
- syntax error in annotation (annotation could not be parsed but rest of field is ok)
fields
- unknown field (record contains an undefined field)
- non-repeatable field is repeated
- required field is missing
- custom field error (e.g. inconsistency between subfields): see Avram: allow custom rules dini-ag-kim/avram#14
subfields
- unknown subfield (field contains an undefined subfield)
- empty subfield value (subfield value should not be the empty string)
- invalid subfield order
- non-repeatable subfield is repeated
- missing required subfield
- subfield value (optionally at a position) does not match expected pattern
- subfield value (optionally at a position) does not match expected code
- custom subfield error: see Avram: allow custom rules dini-ag-kim/avram#14
annotations
- unknown annotation (anything but +, -, , ?, !)
- missing annotation (if annotation expected)
- unexpected annotation (if no annotation was expected)

Error types can be group in categories:

malformed (record / field / subfield / annotation) for syntax errors
unknown (field / subfield / annotation / subfield value) for elements not defined in a schema
missing (field* / subfield / annotation ) for elements that should not occurr
repeated (field / subfield) for elements that should not be repeated
disordered (field / subfield) for elements that should occurr in a different order
invalid (record / field / subfield value) for custom validation errors (e.g. dependencies between fields)

How to encode record validation output

The current format (TAP) is fine for human inspection and basic analysis (just use grep). Another popular format is JUnit XML but this is rather verbose while not covering error types. Someone came up with a less verbose encoding of JUnit XML in JSON. In JUnit XML there are

testsuites (each with a required name)
testcases (each with a required name and a required classname)
errors (each with a message, type and description)

This does not fit to the information needed to analyse validation results (fields, subfields, error types...) so JUnit XML does not fit. Data quality analysis differs from unit testing.

The text was updated successfully, but these errors were encountered:

nichtich · 2021-09-23T07:47:43Z

Schema validators for other schema languages might be of help, e.g. ajv JSON Schema error objects. Similar to this, an error object could consists of:

record : PPN or record counter
path: PICA field and optional subfield (e.g. 003@, 021A$a...) could also be split as field and subfield
error: error type (field_unknown, subfield_missing...)
message: human readable error message
data: the full record, field or subfield value that caused an error

jorol · 2021-09-23T15:07:41Z

👍++

I would like to keep the default output on the command-line simple, maybe just tab-separate the values in a specific order.

nichtich · 2021-09-29T11:27:44Z

Also relevant: definition of validation rules in Avram (see https://format.gbv.de/schema/avram/specification#validation-rules and dini-ag-kim/avram#5

nichtich mentioned this issue Sep 20, 2021

Avram schema validation deutsche-nationalbibliothek/pica-rs#288

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine-readable output format of record validation #116

Machine-readable output format of record validation #116

nichtich commented Aug 18, 2021 •

edited

Loading

nichtich commented Sep 23, 2021

jorol commented Sep 23, 2021

nichtich commented Sep 29, 2021

Machine-readable output format of record validation #116

Machine-readable output format of record validation #116

Comments

nichtich commented Aug 18, 2021 • edited Loading

What should be encoded in record validation output?

How to encode record validation output

nichtich commented Sep 23, 2021

jorol commented Sep 23, 2021

nichtich commented Sep 29, 2021

nichtich commented Aug 18, 2021 •

edited

Loading