You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
a. the number of options for dataset formatting will continue to grow and that having a CLI flag for each will get ungainly, and
b. this package will primarily be used by machines,
it seems like a good idea to accept a JSON blob containing all the bits. Indeed, we pretty much have to to accept schemas, anyway.
There are some options here:
switch out the generate interface entirely (i.e. remove the current one): datalogistik generate '<blob>'
add a separate interface in addition: datalogistik generate --json '<blob' or datalogistik generate-json '<blob>'
accept blobs for certain parameters that can get complicated: datalogistik generate -d fanniemae -f '<blob>' or datalogistik generate -d fanniemae --format-json '<blob>'
There are tradeoffs in maintenance burden and human usability. Regardless, JSON schemas should be
i. well documented in a fashion that will stay in sync as they evolve
ii. not require all fields to be specified where they don't matter (e.g. chunk size for CSVs) or defaults are fine (chunk size for parquet most of the time)
Given the new Dataset class in #62, it probably makes sense to accept part or all of its JSON-serialized form, so we could just unpack it with Dataset(**<blob>) or Dataset(name="fanniemae", format=**<blob>).
Before picking up work for this, we should make a decision of which option we prefer.
The text was updated successfully, but these errors were encountered:
Given
a. the number of options for dataset formatting will continue to grow and that having a CLI flag for each will get ungainly, and
b. this package will primarily be used by machines,
it seems like a good idea to accept a JSON blob containing all the bits. Indeed, we pretty much have to to accept schemas, anyway.
There are some options here:
generate
interface entirely (i.e. remove the current one):datalogistik generate '<blob>'
datalogistik generate --json '<blob'
ordatalogistik generate-json '<blob>'
datalogistik generate -d fanniemae -f '<blob>'
ordatalogistik generate -d fanniemae --format-json '<blob>'
There are tradeoffs in maintenance burden and human usability. Regardless, JSON schemas should be
i. well documented in a fashion that will stay in sync as they evolve
ii. not require all fields to be specified where they don't matter (e.g. chunk size for CSVs) or defaults are fine (chunk size for parquet most of the time)
Given the new
Dataset
class in #62, it probably makes sense to accept part or all of its JSON-serialized form, so we could just unpack it withDataset(**<blob>)
orDataset(name="fanniemae", format=**<blob>)
.Before picking up work for this, we should make a decision of which option we prefer.
The text was updated successfully, but these errors were encountered: