We classify data quality issues into 7 Quality Metrics, with the following definitions:
Quality Metric | Description |
---|---|
COMPLETENESS | Refers to data that is incomplete or completely missing. For example, whether some text data is truncated or the content is empty. |
EFFECTIVENESS | Refers to whether the data is meaningful, suitable for a specific task, and conforms to the expected format or standard. For example, whether the text content contains garbled characters. |
FLUENCY | Refers to whether the data is fluent, grammatically correct, and can be read naturally. For example, whether sentences conform to the grammatical rules. |
RELEVANCE | Refers to data that contains data that is irrelevant to the task. For example, some texts describe medical knowledge, but insert irrelevant advertising content. |
SECURITY | Refers to whether the data contains sensitive or private information and whether it conforms to the culture and values of various countries (the other party's values & our values). |
SIMILARITY | Refers to whether the data content is repeated or there is very similar content. |
UNDERSTANDABILITY | Refers to whether the data is easy to understand and interpret. For example, whether the data is clear, unambiguous, and meaningful in context. |