Replies: 2 comments 5 replies
-
This discussion was initially started in our Slack channel, but I wanted to move the discussion here before we got too far along. I've copied the response thread from Slack into this comment so that we don't lose context: Johnny Graettinger 1 day ago Johnny Graettinger 1 day ago Johnny Graettinger 1 day ago Johnny Graettinger 1 day ago Phil 1 day ago Phil 1 day ago Phil 1 day ago |
Beta Was this translation helpful? Give feedback.
-
I've done some more thinking on the inputs needed by the parser cli, and I think I've arrived at a workable concrete proposal. Feedback appreciated. First, here's an attempt to enumerate all the inputs that the parser will need, broken down by what's known at different times: User provides as part of the connector configuration:
User provides as part of their collection:
Connector provides at runtime:
Of course technically, all those inputs are provided by the connector, but the connector-provided properties are special because they can often only be known by the connector. For example, if you capture from an S3 prefix, the connector won't know the filenames and content types until it sees the objects appear in the bucket. A concrete proposalSo the though here is that these three groups ought to each be provided to the parser CLI separately, to avoid the connectors having to do a bunch of munging to extract the relevant sections of their config and catalog. Here's a proposal for how that might work. Please feel free to give feedback or alternative ideas. The "connector configuration" group is provided as a JSON object in the Flow catalog, as part of the connector configuration json. This is simply passed to the connector as part of the normal configuration json. The connector invokes the parser CLI with The Finally, the Putting it all together, let's say the connector was invoked with
Note on Projections are an easy and powerful way for users to control the mapping between JSON and tabular formats, but they're also completely optional. In Flow, projections are automatically inferred from the JSON schema, and the intent is for the parser cli to do exactly the same thing. The docs on projections can be found here. The |
Beta Was this translation helpful? Give feedback.
-
As we build connectors, one of the things that we'll need along with them is parsers. The Airbyte protocol seems pretty opinionated that parsing happens internally within each connector. For example, each connector is expected to output JSON documents, so if the a CSV file is being imported, then the connector must be responsible for converting the CSV to JSON. This approach is pretty reasonable, but it does have some drawbacks.
Things to watch out for
The main drawback is the potential for inconsistency of parsers in different connectors. We want all our connectors to parse each file format in the same way, and we want the parser configuration to look more or less the same for each connector where it makes sense. That requires a factoring where all our connectors can use the same underlying code for parsing.
An easily forgotten point here is that error reporting should also be pretty consistent. Bad error reporting can have a profound negative impact on the user experience. Any capture that does parsing is relatively likely to encounter parse errors. For example, ask a user how their CSV formatter handles quote characters within strings. Does it double the characters to escape them (" => ""), or does it use backslash escape sequences (" => ")? Most people would say something like "uhh, I'll have to check". Users want to just give it a try with the defaults, and then see if it breaks. But that workflow requires that we consistently provide descriptive error messages with location information.
Also on the topic of parse errors, it can be extremely useful to have parsers available locally connectors during development and debugging. A common use case is to get a problematic file pulled down locally, so you can test with different parser configurations and wee what works. This doesn't necessarily require that parsers are decoupled from connectors, of course, since we could just have a "local disk" connector for use during development (this was Paxata's approach). But it's a pretty common use case that can have a pretty big impact on user experience, so I think the main takeaway is that the debugging experience should be overall pretty frictionless, however we approach it.
Suggested interface
In terms of an interface, I think we'd want something like the following:
The format argument is optional, and allows users to manually specify the format of files. The way fivetran handles this is I think pretty good: if the user specifies the format, then fivetran won't try to detect the format, it'll just use whatever they specified. But if format is blank, then they'll run a detection on each file and try to do the right thing automatically. This works prety well, since they allow you to easily filter your files. So you could configure the connector to only import .*.whatever files and manually set the format to have them parsed however you want.
The config argument is a structure with configuration for all possible formats, for example:
Putting all the format configuration into one object works well for cases where we may want to capture files of different formats, and where we want to detect formats automatically where possible. We should have reasonable defaults for all the fields, which will oftentimes be inferred from inspecting the actual file contents. So I don't expect that most users will have to set a bunch of these options. But we should anticipate that users will have directories with multiple filtypes and handle automatically when possible.
In addition to actual parsing options, I think we'll want to also have some consistent options for automatically adding fields to the output JSON. For example, to add the original filename to /_meta/sourceFilename, or to add the line number to /_meta/sourceLineNumber. I could imagine use cases for adding other bits of information to each document. For example, the name of the kinesis stream, or something else to differentiate one capture source from another. This is probably functionality we can add later, as long as our design can accomodate it.
Suggested implementation
An idea that I'm really liking is to implement our parsers as a CLI, probably in Rust. This makes it easy to have that consistency, even when connectors are written in different languages. And using Rust is appealing in part because it gives us easy access to our JSON schema inference code, and because it's generally very well suited to efficient and safe parsing.
Another potential benefit is that such a parser could theoretically also be adopted by Airbyte themselves, providing greater consistency across the entire connector ecosystem. A statically linked binary would be pretty easy to use with any language and docker image, so this could actually be fairly realistic to achieve if there's any interest from their team.
Putting it all together, we'd end up with a consisten parser config structure that could be embedded in any connector configuration structure. So our S3 configuration might look something like:
Right now we really just have the S3 connector that will need to do parsing, so I think we'll want to keep things as simple as possible to start out. But having a very loose coupling between the connectors and the parser will make it easy to update and add support for new formats and options. That doesn't necessarily require the parser to be a CLI, but at this point that's looking like the most compelling option to me.
Beta Was this translation helpful? Give feedback.
All reactions