Parser CLI #160

psFried · 2021-06-16T16:22:45Z

psFried
Jun 16, 2021
Maintainer

As we build connectors, one of the things that we'll need along with them is parsers. The Airbyte protocol seems pretty opinionated that parsing happens internally within each connector. For example, each connector is expected to output JSON documents, so if the a CSV file is being imported, then the connector must be responsible for converting the CSV to JSON. This approach is pretty reasonable, but it does have some drawbacks.

Things to watch out for

The main drawback is the potential for inconsistency of parsers in different connectors. We want all our connectors to parse each file format in the same way, and we want the parser configuration to look more or less the same for each connector where it makes sense. That requires a factoring where all our connectors can use the same underlying code for parsing.

An easily forgotten point here is that error reporting should also be pretty consistent. Bad error reporting can have a profound negative impact on the user experience. Any capture that does parsing is relatively likely to encounter parse errors. For example, ask a user how their CSV formatter handles quote characters within strings. Does it double the characters to escape them (" => ""), or does it use backslash escape sequences (" => ")? Most people would say something like "uhh, I'll have to check". Users want to just give it a try with the defaults, and then see if it breaks. But that workflow requires that we consistently provide descriptive error messages with location information.

Also on the topic of parse errors, it can be extremely useful to have parsers available locally connectors during development and debugging. A common use case is to get a problematic file pulled down locally, so you can test with different parser configurations and wee what works. This doesn't necessarily require that parsers are decoupled from connectors, of course, since we could just have a "local disk" connector for use during development (this was Paxata's approach). But it's a pretty common use case that can have a pretty big impact on user experience, so I think the main takeaway is that the debugging experience should be overall pretty frictionless, however we approach it.

Suggested interface

In terms of an interface, I think we'd want something like the following:

def parse(filename, inputStream, format, config, outputJsonSchema) -> outputStream, errorStream

The format argument is optional, and allows users to manually specify the format of files. The way fivetran handles this is I think pretty good: if the user specifies the format, then fivetran won't try to detect the format, it'll just use whatever they specified. But if format is blank, then they'll run a detection on each file and try to do the right thing automatically. This works prety well, since they allow you to easily filter your files. So you could configure the connector to only import .*.whatever files and manually set the format to have them parsed however you want.

The config argument is a structure with configuration for all possible formats, for example:

{
    "csv": {
        "separator": "|",
        "doubleQuoteEscapes": true,
        "backslashEscapes": true
    },
    "xml": {
        "summonGreaterDemon": "cthulu" // Or whatever xml parsers do
    },

    // Filtering files within tar/gzip archives should work just like it does in fivetran
    "archiveFileFilter": ".*\.csv", 

}

Putting all the format configuration into one object works well for cases where we may want to capture files of different formats, and where we want to detect formats automatically where possible. We should have reasonable defaults for all the fields, which will oftentimes be inferred from inspecting the actual file contents. So I don't expect that most users will have to set a bunch of these options. But we should anticipate that users will have directories with multiple filtypes and handle automatically when possible.

In addition to actual parsing options, I think we'll want to also have some consistent options for automatically adding fields to the output JSON. For example, to add the original filename to /_meta/sourceFilename, or to add the line number to /_meta/sourceLineNumber. I could imagine use cases for adding other bits of information to each document. For example, the name of the kinesis stream, or something else to differentiate one capture source from another. This is probably functionality we can add later, as long as our design can accomodate it.

Suggested implementation

An idea that I'm really liking is to implement our parsers as a CLI, probably in Rust. This makes it easy to have that consistency, even when connectors are written in different languages. And using Rust is appealing in part because it gives us easy access to our JSON schema inference code, and because it's generally very well suited to efficient and safe parsing.

Another potential benefit is that such a parser could theoretically also be adopted by Airbyte themselves, providing greater consistency across the entire connector ecosystem. A statically linked binary would be pretty easy to use with any language and docker image, so this could actually be fairly realistic to achieve if there's any interest from their team.

Putting it all together, we'd end up with a consisten parser config structure that could be embedded in any connector configuration structure. So our S3 configuration might look something like:

{
    "region": "us-west-2",
    "awsAccessKeyId": "foo",
    "awsSecretAccessKey": "bar",
    "parser": {
        "csv": {
            "separator": "|",
            ...
        },
        ...
    }
}

Right now we really just have the S3 connector that will need to do parsing, so I think we'll want to keep things as simple as possible to start out. But having a very loose coupling between the connectors and the parser will make it easy to update and add support for new formats and options. That doesn't necessarily require the parser to be a CLI, but at this point that's looking like the most compelling option to me.

psFried · 2021-06-16T16:26:05Z

psFried
Jun 16, 2021
Maintainer Author

This discussion was initially started in our Slack channel, but I wanted to move the discussion here before we got too far along. I've copied the response thread from Slack into this comment so that we don't lose context:

Johnny Graettinger 1 day ago
I like it. So to summarize / parrot back:
A binary that:
takes a configuration that it owns that gives defaults and overrides on how to approach parsing (could an explicit format be part of this config also?)
takes a collection schema which it uses for static inference / projections in line with our current handling
sniffs / makes a determination about how to parse the input
reads inputs, converts, and spits out output documents.
An S3 connector would concern itself with fetching files, and feed each into this CLI.
Or, a kinesis connector that also handled CSV could do the same by wrapping a stream reader in this CLI invocation ?
Is the filename an important input ? or should the CLI take an optional mime type (that may be matched from a filename, or a header of a POST request ?)

Johnny Graettinger 1 day ago
Huh i suppose it's desired to plumb through /_meta properties

Johnny Graettinger 1 day ago
i presume also we'd like to support behavior like our current CSV ingestion, using projections to re-hydrate a nested shape based on CSV column headers.

Johnny Graettinger 1 day ago
oh and the CLI can also spit out it's own configuration schema, which the connector can just wrap in its specification. which is cute.

Phil 1 day ago
Oh I really like the idea of the parser spitting out its own schema. Then we really have nothing to update when we release a new parser version, except for the version of the executable in the image

Phil 1 day ago
I agree that we should do basically the same as what we do now wrt csv headers. I think even wiring up the same parser to our websocket ingestion could be good.... Actually that would be pretty sweet, since then our websocket ingester would always support all the same formats as the connectors. And being able to upload excel sheets seems like it would be a great feature

Phil 1 day ago
I think your summary matches what I was thinking. I'm still not confident about how all those inputs should be modeled or provided, so what's there is more like a straw man proposal, or a starting point. I'll put some thought into mime type sniffing in particular, before I respond to your question about the filename. My gut reaction was to have any sniffing be entirely internal to the parser and all in that one call. But I can see the appeal of breaking it down into two separate steps

0 replies

psFried · 2021-06-16T16:33:36Z

psFried
Jun 16, 2021
Maintainer Author

I've done some more thinking on the inputs needed by the parser cli, and I think I've arrived at a workable concrete proposal. Feedback appreciated. First, here's an attempt to enumerate all the inputs that the parser will need, broken down by what's known at different times:

User provides as part of the connector configuration:

Configuration of all format parse options, such as CSV separator and stuff like that.
format: if the user has manually decided which format to use for parsing, then we will not try to detect the format automatically, and will use whatever format they provided.
addSourceLineNumber: Value is a JSON pointer specifying the location within the final documents where the source line number should be added. Can maybe default to /_meta/sourceLine in the Flow catalog, but we probably don't want to make this the default for the parser cli since _meta is kind of specific to Flow.
addSourceFilename: Works the same as addSourceLineNumber.
addProperties (?): Map of static properties to JSON pointer locations where they should be added to each document. We may not want to rush to implement this, since there's not really a clear use case. Just seems useful for being able to partition collections based on the capture source, but there may be other approaches for that.

User provides as part of their collection:

projections: Optional mappings of field names to json pointers, which can be used to create json documents of the correct shape from tabular input formats such as CSV (see note below).
schema: json schema that all output records should validate against.

Connector provides at runtime:

filename: can be used for format detection, and may also be added to each document
content-type: can be used for format detection, if known. Some systems may associate a mime type with content, which can of course be useful in determining how to parse a given stream.

Of course technically, all those inputs are provided by the connector, but the connector-provided properties are special because they can often only be known by the connector. For example, if you capture from an S3 prefix, the connector won't know the filenames and content types until it sees the objects appear in the bucket.

A concrete proposal

So the though here is that these three groups ought to each be provided to the parser CLI separately, to avoid the connectors having to do a bunch of munging to extract the relevant sections of their config and catalog. Here's a proposal for how that might work. Please feel free to give feedback or alternative ideas.

The "connector configuration" group is provided as a JSON object in the Flow catalog, as part of the connector configuration json. This is simply passed to the connector as part of the normal configuration json. The connector invokes the parser CLI with --config-file <path/to/entire/config.json> and --config-pointer </json/pointer/to/parser/config>. This allows the parser CLI to extract the relevant portion of the configuration from the Airbyte connector config file, without needing any specific knowledge of the the rest of that configuration json. It also eliminates the need for the connector to extract portions of the configuration in order to pass it to the parser.

The schema and projections passed as part of the ConfiguredAirbyteCatalog. This requires either updating or extending the Airbyte spec to add projections to the ConfiguredAirbyteStream. The connector would take these two and pass them to the parser in the same way as the configuration, by passing --schema-file <path/to/configured/catalog.json>, --schema-pointer </json/pointer/to/schema>, --projections-file <path/to/configured/catalog.json>, and --projections-pointer </json/pointer/to/projectsion>

Finally, the filename and content-type are passed as separate, optional CLI arguments.

Putting it all together, let's say the connector was invoked with --config myConfig.json --catalog myCatalog.json. In the connector's config, the parser config might be located within a nested property called parse. Say the connector invokes the parser for the first stream in the catalog, it would pass the arguments:

--config-file myConfig.json --config-pointer /parse \
--schema-file myCatalog.json --schema-pointer /streams/0/stream/json_schema \
--projections-file myCatalog.json --projections-pointer /streams/0/stream/projections

Note on projections:

Projections are an easy and powerful way for users to control the mapping between JSON and tabular formats, but they're also completely optional. In Flow, projections are automatically inferred from the JSON schema, and the intent is for the parser cli to do exactly the same thing. The docs on projections can be found here. The projections may also be configured explicitly in Flow, and the intent is for those explicit projections to also be provided to the parser CLI. The behavior would be for explicitly provided projectsion to take precedence over those that were automatically inferred from the schema.

5 replies

psFried Jun 16, 2021
Maintainer Author

An alternative to the separate file and pointer arguments would be to use a URI with a fragment pointer. So --config-file myConfig.json --config-pointer /foo/bar would become --config file:///path/to/myConfig.json#/foo/bar. Personally, I find this form slightly more elegant and flexible. The downside is that it pushes slightly more complexity into the connectors, since they'll have to format the URIs instead of supplying the pointers separately. I think that's an acceptable tradeoff, though, because dealing with URIs is something that every language has decent support for.

jgraettinger Jun 16, 2021
Maintainer

A couple of thoughts:

Could addSourceFilename be part of addProperties ? We can further separate the "info provided to identify content type" vs "stuff to attach to the document" concerns. I think addProperties is important, and we'll also want it for an etag or version or mod time of the file, for instance.

The parser CLI is not meant for human consumption, so... can we just stuff everything it needs into a JSON object and feed it as the first newline of its stdin ? If I were writing a connector, I think that's how I'd want to drive it. Having it find its own config within the connector config, while taking other stuff that the connector has to marshal separately into files prior to invocation is a lot of mixing of concerns. Plus, all languages have easy ways to decode (or even defer via RawMessage) arbitrary JSON and pass it through into another marshalled object.

psFried Jun 17, 2021
Maintainer Author

Yeah, I think addSourceFilename could be folded into addProperties. It just means that the connector would be responsible for merging that into the json object. IMO, passing pointers to the relevant portions of the config is the least amount of work for the connector. But it's not like adding properties to JSON is hard, so it wouldn't be a big deal for connectors to do. If we go that route, I think it would make sense to carry it all the way to putting everything in one json object. The thought being that, if the connector is going to compose JSON at all, it might as well just compose it all into a single object.

In terms of the mechanism for passing the JSON, I think it would be preferable to pass it in an environment variable. I like the idea of keeping stdin for a single purpose only. Putting both the configuration and the data stream on stdin seems a little confusing to me.

I think at the end of the day, I'm honestly not sure which approach I prefer (separate files vs one json object). I do like the explicitness of having everything in one json object, though. And passing it in an environment variable seems like it would be pretty easy to work with.

After reading your comment again, though, I think maybe I should clarify that the files with the connector configuration and catalog aren't modified by the connector at all. They're just the files that are passed to the connector as arguments, and the connector has knowledge of where the parser configuration is embedded within its own configuration. Same for the schema and projections within the ConfiguredAirbyteCatalog. The file would already be there, and the connector would only need to construct a pointer with the index of the Stream. I think that approach is really easy, but that's not really the same as simple, and admittedly it's also a little weird.

jgraettinger Jun 17, 2021
Maintainer

IMO, passing pointers to the relevant portions of the config is the least amount of work for the connector.

Maybe? It does require that the connector keep a (probably hard-coded) JSON pointer in sync with it's configuration structure. Which might otherwise be able to leverage automatic JSON schema generation (via schemars or similar). Not a big deal, but also not clear-cut "better" to me.

pass it in an environment variable. I like the idea of keeping stdin for a single purpose only.

Yea, very fair. Maybe as a single "spec" file then? Env variables are limited to 128KiB on Linux, for example.

I should clarify that the files with the connector configuration and catalog aren't modified by the connector at all

I was spiking out a Citibike source yesterday and realized I needed exactly this parser in order to do the hoisting from CSV columns to JSON. But because the dataset is known, so is the configuration of the parser, and it wouldn't be appropriate to surface that in the user-facing connector configuration. There will probably be situations also where some but not all of the parsers configuration is known, etc, and the connector will have a roll in filling out the parser's config, or shadowing some of the irrelevant bits of configuration so it's not presented in UI, etc.

psFried Jun 17, 2021
Maintainer Author

Maybe as a single "spec" file then?

TLDR: I think a --config-file argument sounds like the way to go.

I thought more about having separate configurations for different things and realized that, while it seems to match the hierarchical configuration model of airbyte connectors (a single connector config with many streams, each with their own configuration), not all connectors will work like that. And you bring up a good point that many connectors may want to have a roll in the parser configuration anyway.

I did look into the limits on env variables, and found that on my system the practical limit is ~1.8MiB, but that it's entirely dependent on the system and even what other env variables are set for the process. And bundled JSON schemas for multiple streams could get pretty big, so maybe best not to push the limit.

So yea, putting it in a separate file seems like the best combination of safe and easy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser CLI #160

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Parser CLI #160

psFried Jun 16, 2021 Maintainer

Things to watch out for

Suggested interface

Suggested implementation

Replies: 2 comments · 5 replies

psFried Jun 16, 2021 Maintainer Author

psFried Jun 16, 2021 Maintainer Author

A concrete proposal

psFried Jun 16, 2021 Maintainer Author

jgraettinger Jun 16, 2021 Maintainer

psFried Jun 17, 2021 Maintainer Author

jgraettinger Jun 17, 2021 Maintainer

psFried Jun 17, 2021 Maintainer Author

psFried
Jun 16, 2021
Maintainer

Replies: 2 comments 5 replies

psFried
Jun 16, 2021
Maintainer Author

psFried
Jun 16, 2021
Maintainer Author

psFried Jun 16, 2021
Maintainer Author

jgraettinger Jun 16, 2021
Maintainer

psFried Jun 17, 2021
Maintainer Author

jgraettinger Jun 17, 2021
Maintainer

psFried Jun 17, 2021
Maintainer Author