From 74db7898d71b974d637e5de088e1f0ccdfa9ea4d Mon Sep 17 00:00:00 2001 From: Pedro Holanda Date: Wed, 8 May 2024 16:50:19 +0200 Subject: [PATCH 1/4] new rejects --- docs/data/csv/reading_faulty_csv_files.md | 159 +++++++++++++++++----- 1 file changed, 127 insertions(+), 32 deletions(-) diff --git a/docs/data/csv/reading_faulty_csv_files.md b/docs/data/csv/reading_faulty_csv_files.md index a2dd96740b7..ef0ac6fb269 100644 --- a/docs/data/csv/reading_faulty_csv_files.md +++ b/docs/data/csv/reading_faulty_csv_files.md @@ -3,11 +3,79 @@ layout: docu title: Reading Faulty CSV Files redirect_from: --- +CSV files can come in all shapes and forms, with some presenting many errors that make the process of cleanly reading them inherently difficult. To help users read these files, DuckDB supports detailed error messages, the ability to skip faulty lines, and the possibility of storing faulty lines in a temporary table to assist users with a data cleaning step. -Reading erroneous CSV files is possible by utilizing the `ignore_errors` option. With that option set, rows containing data that would otherwise cause the CSV Parser to generate an error will be ignored. +## Structural Errors +DuckDB supports the detection and skipping of several different structural errors. In this section, we will go over each error with an example. +For the examples, consider the following table: + +```sql +CREATE TABLE people (name VARCHAR, birth_date DATE); +``` + +* CAST: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the line `Pedro,The 90s` would cause an error since the string `The 90s` cannot be cast to a date. +* MISSING COLUMNS: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g., `Pedro`, would cause this error. +* TOO MANY COLUMNS: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g., `Pedro,01-01-1992,pdet`. +* UNQUOTED VALUE: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner uses `quote="`, the line `"pedro"holanda, 01-01-1992` would present an unquoted value error. +* LINE SIZE OVER MAXIMUM: DuckDB has a parameter that sets the maximum line size a CSV file can have, which by default is set to `2,097,152`bytes. Assuming our scanner is set to `max_line_size = 25`, the line `Pedro Holanda, 01-01-1992` would produce an error, as it exceeds 25 bytes. +* INVALID UNICODE: DuckDB only supports UTF-8 strings; thus, lines containing non-UTF-8 characters will produce an error. For example, the line `pedro\xff\xff, 01-01-1992` would be problematic. + +### Anatomy of a CSV error +By default, when performing a CSV read, if any structural errors are encountered, the scanner will immediately stop the scanning process and throw the error to the user. +These errors are designed to provide as much information as possible to allow users to evaluate them directly in their CSV file. + +This is the full error message: +``` +Conversion Error: CSV Error on Line: 5648 +Original Line: Pedro,The 90s +Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) + +Column date is being converted as type DATE +This type was auto-detected from the CSV file. +Possible solutions: +* Override the type for this column manually by setting the type explicitly, e.g. types={'birth_date': 'VARCHAR'} +* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 +* Use a COPY statement to automatically derive types from an existing table. + + file= people.csv + delimiter = , (Auto-Detected) + quote = " (Auto-Detected) + escape = " (Auto-Detected) + new_line = \r\n (Auto-Detected) + header = true (Auto-Detected) + skip_rows = 0 (Auto-Detected) + date_format = (DD-MM-YYYY) (Auto-Detected) + timestamp_format = (Auto-Detected) + null_padding=0 + sample_size=20480 + ignore_errors=false + all_varchar=0 +``` + +The first block provides us with information regarding where the error occurred, including the line number, the original CSV line, and which field was problematic. +``` +Conversion Error: CSV Error on Line: 5648 +Original Line: Pedro,The 90s +Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) +``` + +The second block provides us with potential solutions. +``` +Column date is being converted as type DATE +This type was auto-detected from the CSV file. +Possible solutions: +* Override the type for this column manually by setting the type explicitly, e.g. types={'birth_date': 'VARCHAR'} +* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 +* Use a COPY statement to automatically derive types from an existing table. +``` +Since the type of this field was auto-detected, it suggests defining the field as a `VARCHAR` or fully utilizing the dataset for type detection. + +Finally, the last block presents some of the options used in the scanner that can cause errors, indicating whether they were auto-detected or manually set by the user. ## Using the `ignore_errors` Option +There are cases where CSV files may have multiple structural errors, and users simply wish to skip these and read the correct data. Reading erroneous CSV files is possible by utilizing the `ignore_errors` option. With this option set, rows containing data that would otherwise cause the CSV parser to generate an error will be ignored. In our example, we will demonstrate a CAST error, but note that any of the errors described in our Structural Error section would cause the faulty line to be skipped. + For example, consider the following CSV file, [`faulty.csv`](/data/faulty.csv): ```csv @@ -59,21 +127,51 @@ Outputs: ## Retrieving Faulty CSV Lines -Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature. It is important to note that the Rejects Table can only be used when `ignore_errors` is set, and currently, only stores casting errors and does not save errors when the number of columns differ. +Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature. +By default, this feature creates two temporary tables. +1. `reject_scans`: Stores information regarding the parameters of the CSV Scanner +2. `reject_errors`: Stores information regarding each CSV faulty line and in which CSV Scanner they happened. -The CSV Rejects Table returns the following information: +### Reject Scans + +The CSV Reject Scans Table returns the following information:
| Column name | Description | Type | |:--|:-----|:-| -| `file` | File path. | `VARCHAR` | -| `line` | Line number, from the CSV File, where the error occured. | `INTEGER` | -| `column` | Column number, from the CSV File, where the error occured. | `INTEGER` | -| `column_name` | Column name, from the CSV File, where the error occured. | `VARCHAR` | -| `parsed_value` | The value, where the casting error happened, in a string format. | `VARCHAR` | -| `recovery_columns` | An optional primary key of the CSV File. | `STRUCT {NAME: VALUE}` | -| `error` | Exact error encountered by the parser. | `VARCHAR` | +| `scan_id` | The internal ID used in DuckDB to represent that scanner | `UBIGINT` | +| `file_id` | A scanner might happen over multiple files, so the file_id represents a unique file in a scanner | `UBIGINT` | +| `file_path` | The file path | `VARCHAR` | +| `delimiter` | The delimiter used e.g., ; | `VARCHAR` | +| `quote` | The quote used e.g., " | `VARCHAR` | +| `escape` | The quote used e.g., " | `VARCHAR` | +| `newline_delimiter` | The newline delimiter used e.g., \r\n | `VARCHAR` | +| `skip_rows` | If any rows were skipped from the top of the file | `UINTEGER` | +| `has_header` | If the file has a header | `BOOLEAN` | +| `columns` | The schema of the file (i.e., all column names and types) | `VARCHAR` | +| `date_format` | The format used for date types | `VARCHAR` | +| `timestamp_format` | The format used for timestamp types| `VARCHAR` | +| `user_arguments` | Any extra scanner parameters manually set by the user | `VARCHAR` | + +### Reject Errors + +The CSV Reject Errors Table returns the following information: + +
+ +| Column name | Description | Type | +|:--|:-----|:-| +| `scan_id` | The internal ID used in DuckDB to represent that scanner, used to join with reject scans tables | `UBIGINT` | +| `file_id` | The file_id represents a unique file in a scanner, used to join with reject scans tables | `UBIGINT` | +| `line` | Line number, from the CSV File, where the error occured. | `UBIGINT` | +| `line_byte_position` | Byte Position of the start of the line, where the error occured. | `UBIGINT` | +| `byte_position` | Byte Position where the error occured. | `UBIGINT` | +| `column_idx` | If the error happens in a specific column, the index of the column. | `UBIGINT` | +| `column_name` | If the error happens in a specific column, the name of the column. | `VARCHAR` | +| `error_type` | The type of the error that happened. | `ENUM` | +| `csv_line` | The original CSV line. | `VARCHAR` | +| `error_message` | The error message produced by DuckDB. | `VARCHAR` | ## Parameters @@ -83,21 +181,24 @@ The parameters listed below are used in the `read_csv` function to configure the | Name | Description | Type | Default | |:--|:-----|:-|:-| -| `rejects_table` | Name of a temporary table where the information of the faulty lines of a CSV file are stored. | `VARCHAR` | (empty) | +| `store_rejects` | If set to true, any errors in the file will be skipped and stored in the default rejects temporary tables.| `BOOLEAN` | False | +| `rejects_scan` | Name of a temporary table where the information of the scan information of faulty CSV file are stored. | `VARCHAR` | reject_scans | +| `rejects_table` | Name of a temporary table where the information of the faulty lines of a CSV file are stored. | `VARCHAR` | reject_errors | | `rejects_limit` | Upper limit on the number of faulty records from a CSV file that will be recorded in the rejects table. 0 is used when no limit should be applied. | `BIGINT` | 0 | -To store the information of the faulty CSV lines in a rejects table, the user must simply provide the rejects table name in the`rejects_table` option. For example: + + +To store the information of the faulty CSV lines in a rejects table, the user must simply set the `store_rejects` option to true. For example: ```sql FROM read_csv( 'faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'}, - rejects_table = 'rejects_table', - ignore_errors = true + store_rejects = true ); ``` -You can then query the `rejects_table` table, to retrieve information about the rejected tuples. For example: +You can then query both the `rejects_scans` and `reject_errors` tables, to retrieve information about the rejected tuples. For example: ```sql FROM rejects_table; @@ -107,28 +208,22 @@ Outputs:
-| file | line | column | column_name | parsed_value | error | -|------------|------|--------|-------------|--------------|------------------------------------------------| -| faulty.csv | 2 | 1 | age | three | Could not convert string ' three' to 'INTEGER' | +| scan_id | file_id | file_path | delimiter | quote | escape | newline_delimiter | skip_rows | has_header | columns | date_format | timestamp_format | user_arguments | +|---------|---------|-----------------------------------|-----------|-------|--------|-------------------|-----------|-----------:|--------------------------------------|-------------|------------------|--------------------| +| 5 | 0 | faulty.csv | , | " | " | \n | 0 | false | {'name': 'VARCHAR','age': 'INTEGER'} | | | store_rejects=true | + - +| scan_id | file_id | line | line_byte_position | byte_position | column_idx | column_name | error_type | csv_line | error_message | +|---------|---------|------|--------------------|---------------|------------|-------------|------------|---------------------|------------------------------------------------------------------------------------| +| 5 | 0 | 2 | 10 | 23 | 2 | age | CAST | Oogie Boogie, three | Error when converting column "age". Could not convert string " three" to 'INTEGER' | + + From 69dd58de85259a4701faf331d9d30372895b9935 Mon Sep 17 00:00:00 2001 From: Pedro Holanda Date: Wed, 8 May 2024 16:52:59 +0200 Subject: [PATCH 2/4] add little note --- docs/data/csv/reading_faulty_csv_files.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/data/csv/reading_faulty_csv_files.md b/docs/data/csv/reading_faulty_csv_files.md index ef0ac6fb269..49195324448 100644 --- a/docs/data/csv/reading_faulty_csv_files.md +++ b/docs/data/csv/reading_faulty_csv_files.md @@ -132,6 +132,8 @@ By default, this feature creates two temporary tables. 1. `reject_scans`: Stores information regarding the parameters of the CSV Scanner 2. `reject_errors`: Stores information regarding each CSV faulty line and in which CSV Scanner they happened. +Note that any of the errors described in our Structural Error section will be stored in the rejects tables. Also, if a line has multiple errors, multiple entries will be stored for the same line, one for each error. + ### Reject Scans The CSV Reject Scans Table returns the following information: From f33d89318d11ecf84f40f3d7291e134712e5f0b1 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 8 May 2024 17:19:46 +0200 Subject: [PATCH 3/4] Formatting adjustments --- docs/data/csv/reading_faulty_csv_files.md | 35 ++++++++++++++--------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/docs/data/csv/reading_faulty_csv_files.md b/docs/data/csv/reading_faulty_csv_files.md index 49195324448..64160028327 100644 --- a/docs/data/csv/reading_faulty_csv_files.md +++ b/docs/data/csv/reading_faulty_csv_files.md @@ -3,9 +3,11 @@ layout: docu title: Reading Faulty CSV Files redirect_from: --- + CSV files can come in all shapes and forms, with some presenting many errors that make the process of cleanly reading them inherently difficult. To help users read these files, DuckDB supports detailed error messages, the ability to skip faulty lines, and the possibility of storing faulty lines in a temporary table to assist users with a data cleaning step. ## Structural Errors + DuckDB supports the detection and skipping of several different structural errors. In this section, we will go over each error with an example. For the examples, consider the following table: @@ -13,19 +15,23 @@ For the examples, consider the following table: CREATE TABLE people (name VARCHAR, birth_date DATE); ``` -* CAST: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the line `Pedro,The 90s` would cause an error since the string `The 90s` cannot be cast to a date. -* MISSING COLUMNS: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g., `Pedro`, would cause this error. -* TOO MANY COLUMNS: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g., `Pedro,01-01-1992,pdet`. -* UNQUOTED VALUE: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner uses `quote="`, the line `"pedro"holanda, 01-01-1992` would present an unquoted value error. -* LINE SIZE OVER MAXIMUM: DuckDB has a parameter that sets the maximum line size a CSV file can have, which by default is set to `2,097,152`bytes. Assuming our scanner is set to `max_line_size = 25`, the line `Pedro Holanda, 01-01-1992` would produce an error, as it exceeds 25 bytes. -* INVALID UNICODE: DuckDB only supports UTF-8 strings; thus, lines containing non-UTF-8 characters will produce an error. For example, the line `pedro\xff\xff, 01-01-1992` would be problematic. +DuckDB detects the following error types: + +* `CAST`: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the line `Pedro,The 90s` would cause an error since the string `The 90s` cannot be cast to a date. +* `MISSING COLUMNS`: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g., `Pedro`, would cause this error. +* `TOO MANY COLUMNS`: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g., `Pedro,01-01-1992,pdet`. +* `UNQUOTED VALUE`: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner uses `quote='"'`, the line `"pedro"holanda, 01-01-1992` would present an unquoted value error. +* `LINE SIZE OVER MAXIMUM`: DuckDB has a parameter that sets the maximum line size a CSV file can have, which by default is set to `2,097,152` bytes. Assuming our scanner is set to `max_line_size = 25`, the line `Pedro Holanda, 01-01-1992` would produce an error, as it exceeds 25 bytes. +* `INVALID UNICODE`: DuckDB only supports UTF-8 strings; thus, lines containing non-UTF-8 characters will produce an error. For example, the line `pedro\xff\xff, 01-01-1992` would be problematic. + +### Anatomy of a CSV Error -### Anatomy of a CSV error By default, when performing a CSV read, if any structural errors are encountered, the scanner will immediately stop the scanning process and throw the error to the user. These errors are designed to provide as much information as possible to allow users to evaluate them directly in their CSV file. -This is the full error message: -``` +This is an example for a full error message: + +```console Conversion Error: CSV Error on Line: 5648 Original Line: Pedro,The 90s Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) @@ -52,15 +58,17 @@ Possible solutions: all_varchar=0 ``` -The first block provides us with information regarding where the error occurred, including the line number, the original CSV line, and which field was problematic. -``` +The first block provides us with information regarding where the error occurred, including the line number, the original CSV line, and which field was problematic: + +```console Conversion Error: CSV Error on Line: 5648 Original Line: Pedro,The 90s Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) ``` -The second block provides us with potential solutions. -``` +The second block provides us with potential solutions: + +```console Column date is being converted as type DATE This type was auto-detected from the CSV file. Possible solutions: @@ -68,6 +76,7 @@ Possible solutions: * Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 * Use a COPY statement to automatically derive types from an existing table. ``` + Since the type of this field was auto-detected, it suggests defining the field as a `VARCHAR` or fully utilizing the dataset for type detection. Finally, the last block presents some of the options used in the scanner that can cause errors, indicating whether they were auto-detected or manually set by the user. From 1d16d961b58932bd0b4e053251b5e62840c5e4c4 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 8 May 2024 17:41:50 +0200 Subject: [PATCH 4/4] Fixes --- docs/data/csv/reading_faulty_csv_files.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/docs/data/csv/reading_faulty_csv_files.md b/docs/data/csv/reading_faulty_csv_files.md index 64160028327..ee295577bc6 100644 --- a/docs/data/csv/reading_faulty_csv_files.md +++ b/docs/data/csv/reading_faulty_csv_files.md @@ -136,8 +136,9 @@ Outputs: ## Retrieving Faulty CSV Lines -Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature. +Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature. By default, this feature creates two temporary tables. + 1. `reject_scans`: Stores information regarding the parameters of the CSV Scanner 2. `reject_errors`: Stores information regarding each CSV faulty line and in which CSV Scanner they happened. @@ -197,8 +198,6 @@ The parameters listed below are used in the `read_csv` function to configure the | `rejects_table` | Name of a temporary table where the information of the faulty lines of a CSV file are stored. | `VARCHAR` | reject_errors | | `rejects_limit` | Upper limit on the number of faulty records from a CSV file that will be recorded in the rejects table. 0 is used when no limit should be applied. | `BIGINT` | 0 | - - To store the information of the faulty CSV lines in a rejects table, the user must simply set the `store_rejects` option to true. For example: ```sql @@ -223,8 +222,6 @@ Outputs: |---------|---------|-----------------------------------|-----------|-------|--------|-------------------|-----------|-----------:|--------------------------------------|-------------|------------------|--------------------| | 5 | 0 | faulty.csv | , | " | " | \n | 0 | false | {'name': 'VARCHAR','age': 'INTEGER'} | | | store_rejects=true | - - ```sql FROM rejects_table; ``` @@ -236,5 +233,3 @@ Outputs: | scan_id | file_id | line | line_byte_position | byte_position | column_idx | column_name | error_type | csv_line | error_message | |---------|---------|------|--------------------|---------------|------------|-------------|------------|---------------------|------------------------------------------------------------------------------------| | 5 | 0 | 2 | 10 | 23 | 2 | age | CAST | Oogie Boogie, three | Error when converting column "age". Could not convert string " three" to 'INTEGER' | - -