ydb-platform · evanevanevanevannnn · Jan 24, 2025 · Jan 24, 2025 · Jan 24, 2025 · dorooleg
@@ -6,17 +6,17 @@ An example query to read data:
 
 ```yql
 SELECT
-    *
+  *
 FROM
  object_storage.`*.tsv`
 WITH
 (
- FORMAT = "tsv_with_names",
-    SCHEMA =
- (
- ts Uint32,
-        action Utf8
- )
+  FORMAT = "tsv_with_names",
+  SCHEMA =
+  (
+    ts Uint32,
+    action Utf8
+  )
 );
 ```
 
@@ -32,9 +32,9 @@ SELECT
 FROM
   <object_storage_connection_name>.`<file_path>`
 WITH(
- FORMAT = "<file_format>",
-  SCHEMA = (<schema_definition>),
-  COMPRESSION = "<compression>")
+  FORMAT = "<file_format>",
+  COMPRESSION = "<compression>",
+  SCHEMA = (<schema_definition>))
 WHERE
   <filter>;
 ```
@@ -44,8 +44,8 @@ Where:
 * `object_storage_connection_name` — the name of the external data source leading to the S3 bucket ({{ objstorage-full-name }}).
 * `file_path` — the path to the file or files inside the bucket. Wildcards `*` are supported; more details [in the section](#path_format).
 * `file_format` — the [data format](formats.md#formats) in the files.
-* `schema_definition` — the [schema definition](#schema) of the data stored in the files.
 * `compression` — the [compression format](formats.md#compression_formats) of the files.
+* `schema_definition` — the [schema definition](#schema) of the data stored in the files.
 
 ### Data schema description {#schema}
 
@@ -61,7 +61,33 @@ For example, the data schema below describes a schema field named `Year` of type
 Year Int32 NOT NULL
 ```
 
-If a data field is marked as required (`NOT NULL`) but this field is missing in the processed file, the processing of such a file will result in an error. If a field is marked as optional (`NULL`), no error will occur in the absence of the field in the processed file, but the field will take the value `NULL`.
+If a data field is marked as required (`NOT NULL`) but this field is missing in the processed file, the processing of such a file will result in an error. If a field is marked as optional (`NULL`), no error will occur in the absence of the field in the processed file, but the field will take the value `NULL`. Keyword `NULL` is optional in this context.
+
+### Schema inference {#inferring}
+
+Schema inference is available for all [data formats](formats.md#formats) except `raw` and `json_as_string`. It can be useful when the schema contains a large number of fields. In order not to enter those fields manually, use the `WITH_INFER` parameter.:
+
+```yql
+SELECT
+  <expression>
+FROM
+  <object_storage_connection_name>.`<file_path>`
+WITH(
+  FORMAT = "<file_format>",
+  COMPRESSION = "<compression>",
+  WITH_INFER = "true")
+WHERE
+  <filter>;
+```
+
+Where:
+
+* `object_storage_connection_name` — the name of the external data source leading to the S3 bucket ({{ objstorage-full-name }}).
+* `file_path` — the path to the file or files inside the bucket. Wildcards `*` are supported; more details [in the section](#path_format).
+* `file_format` — the [data format](formats.md#formats) in the files. All formats except for `raw` and `json_as_string` are supported.
+* `compression` — the [compression format](formats.md#compression_formats) of the files.
+
+As a result of executing such a query, the names and types of fields will be inferred.
 
 ### Data path formats {#path_format}
 
@@ -75,18 +101,18 @@ Example query to read data from S3 ({{ objstorage-full-name }}):
 
 ```yql
 SELECT
-    *
+  *
 FROM
-    connection.`folder/filename.csv`
+  connection.`folder/filename.csv`
 WITH(
- FORMAT = "csv_with_names",
-    SCHEMA =
- (
-        Year Int32,
- Manufacturer Utf8,
- Model Utf8,
- Price Double
- )
+  FORMAT = "csv_with_names",
+  SCHEMA =
+  (
+    Year Int32,
+    Manufacturer Utf8,
+    Model Utf8,
+    Price Double
+  )
 );
 ```
 

@@ -6,17 +6,17 @@
 
 ```yql
 SELECT
-    *
+  *
 FROM
-    object_storage.`*.tsv`
+  object_storage.`*.tsv`
 WITH
 (
-    FORMAT = "tsv_with_names",
-    SCHEMA =
-    (
-        ts Uint32,
-        action Utf8
-    )
+  FORMAT = "tsv_with_names",
+  SCHEMA =
+  (
+    ts Uint32,
+    action Utf8
+  )
 );
 ```
 
@@ -33,8 +33,8 @@ FROM
   <object_storage_connection_name>.`<file_path>`
 WITH(
   FORMAT = "<file_format>",
-  SCHEMA = (<schema_definition>),
-  COMPRESSION = "<compression>")
+  COMPRESSION = "<compression>",
+  SCHEMA = (<schema_definition>))
 WHERE
   <filter>;
 ```
@@ -44,8 +44,8 @@ WHERE
 * `object_storage_connection_name` — название внешнего источника данных, ведущего на бакет с S3 ({{ objstorage-full-name }}).
 * `file_path` — путь к файлу или файлам внутри бакета. Поддерживаются wildcards `*`, подробнее [в разделе](#path_format).
 * `file_format` — [формат данных](formats.md#formats) в файлах.
-* `schema_definition` — [описание схемы хранимых данных](#schema) в файлах.
 * `compression` — [формат сжатия](formats.md#compression_formats) файлов.
+* `schema_definition` — [описание схемы хранимых данных](#schema) в файлах.
 
 ### Описание схемы данных {#schema}
 
@@ -61,7 +61,33 @@ WHERE
 Year Int32 NOT NULL
 ```
 
-Если поле данных помечено, как обязательное, `NOT NULL`, но это поле отсутствует в обрабатываемом файле, то работа с таким файлом будет завершена с ошибкой. Если поле помечено как необязательное, `NULL`, то при отсутствии поля в обрабатываемом файле не будет возникать ошибки, но поле при этом примет значение `NULL`.
+Если поле данных помечено как обязательное, `NOT NULL`, но это поле отсутствует в обрабатываемом файле, то работа с таким файлом будет завершена с ошибкой. Если поле помечено как необязательное, `NULL`, то при отсутствии поля в обрабатываемом файле не будет возникать ошибки, но поле при этом примет значение `NULL`. Ключевое слово `NULL` в необязательных полях опционально.
+
+### Автоматический вывод схемы данных {#inferring}
+
+Автоматический вывод схемы доступен для всех [форматов данных](formats.md#formats), кроме `raw` и `json_as_string`. Он удобен в том случае, когда схема содержит большое количество полей. Для того, чтобы не вводить эти поля вручную, используйте параметр `WITH_INFER`:
+
+```yql
+SELECT
+  <expression>
+FROM
+  <object_storage_connection_name>.`<file_path>`
+WITH(
+  FORMAT = "<file_format>",
+  COMPRESSION = "<compression>",
+  WITH_INFER = "true")
+WHERE
+  <filter>;
+```
+
+Где:
+
+* `object_storage_connection_name` — название внешнего источника данных, ведущего на бакет с S3 ({{ objstorage-full-name }}).
+* `file_path` — путь к файлу или файлам внутри бакета. Поддерживаются wildcards `*`, подробнее [в разделе](#path_format).
+* `file_format` — [формат данных](formats.md#formats) в файлах. Поддерживаются все форматы, кроме `raw` и `json_as_string`.
+* `compression` — [формат сжатия](formats.md#compression_formats) файлов.
+
+В результате выполнения такого запроса будут автоматически выведены названия и типы полей.
 
 ### Форматы путей к данным {#path_format}
 
@@ -75,18 +101,18 @@ Year Int32 NOT NULL
 
 ```yql
 SELECT
-    *
+  *
 FROM
-    connection.`folder/filename.csv`
+  connection.`folder/filename.csv`
 WITH(
-    FORMAT = "csv_with_names",
-    SCHEMA =
-    (
-        Year Int32,
-        Manufacturer Utf8,
-        Model Utf8,
-        Price Double
-    )
+  FORMAT = "csv_with_names",
+  SCHEMA =
+  (
+    Year Int32,
+    Manufacturer Utf8,
+    Model Utf8,
+    Price Double
+  )
 );
 ```