You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What should be encoded in record validation output?
There are syntax errors when a record could not be parsed (e.g. broken character encoding, pseudo-tags or subfield codes, garbage etc.: for instance a field with PICA tag 037l, typo of 037I or 037L) and validation errors.
Questions:
Which format? TAP format is not verbose enough, JUnit XML cannot be streamed and is not strictly specified but most common nevertheless.
What information must be recorded?
Which record (number and/or record identifier)
Which field(s) and subfield(s)
Type of error
Some error types can be found here. Do we need an identifier for each error type? Here an (possibly incomplete) list of error types:
syntax errors
syntax error in record (full record could not be parsed)
syntax error in field (field could not be parsed)
syntax error in annotation (annotation could not be parsed but rest of field is ok)
fields
unknown field (record contains an undefined field)
unexpected annotation (if no annotation was expected)
Error types can be group in categories:
malformed (record / field / subfield / annotation) for syntax errors
unknown (field / subfield / annotation / subfield value) for elements not defined in a schema
missing (field* / subfield / annotation ) for elements that should not occurr
repeated (field / subfield) for elements that should not be repeated
disordered (field / subfield) for elements that should occurr in a different order
invalid (record / field / subfield value) for custom validation errors (e.g. dependencies between fields)
How to encode record validation output
The current format (TAP) is fine for human inspection and basic analysis (just use grep). Another popular format is JUnit XML but this is rather verbose while not covering error types. Someone came up with a less verbose encoding of JUnit XML in JSON. In JUnit XML there are
testsuites (each with a required name)
testcases (each with a required name and a required classname)
errors (each with a message, type and description)
This does not fit to the information needed to analyse validation results (fields, subfields, error types...) so JUnit XML does not fit. Data quality analysis differs from unit testing.
The text was updated successfully, but these errors were encountered:
To further process validation results.
What should be encoded in record validation output?
There are syntax errors when a record could not be parsed (e.g. broken character encoding, pseudo-tags or subfield codes, garbage etc.: for instance a field with PICA tag
037l
, typo of037I
or037L
) and validation errors.Questions:
Some error types can be found here. Do we need an identifier for each error type? Here an (possibly incomplete) list of error types:
+
,-
,?
,!
)Error types can be group in categories:
malformed
(record / field / subfield / annotation) for syntax errorsunknown
(field / subfield / annotation / subfield value) for elements not defined in a schemamissing
(field* / subfield / annotation ) for elements that should not occurrrepeated
(field / subfield) for elements that should not be repeateddisordered
(field / subfield) for elements that should occurr in a different orderinvalid
(record / field / subfield value) for custom validation errors (e.g. dependencies between fields)How to encode record validation output
The current format (TAP) is fine for human inspection and basic analysis (just use grep). Another popular format is JUnit XML but this is rather verbose while not covering error types. Someone came up with a less verbose encoding of JUnit XML in JSON. In JUnit XML there are
name
)name
and a requiredclassname
)message
,type
anddescription
)This does not fit to the information needed to analyse validation results (fields, subfields, error types...) so JUnit XML does not fit. Data quality analysis differs from unit testing.
The text was updated successfully, but these errors were encountered: