Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Output Broken #55

Closed
gabrielfior opened this issue Sep 1, 2023 · 11 comments · Fixed by #57
Closed

CSV Output Broken #55

gabrielfior opened this issue Sep 1, 2023 · 11 comments · Fixed by #57

Comments

@gabrielfior
Copy link

gabrielfior commented Sep 1, 2023

Not sure if this is the right forum, hopefully it is.

I'm just getting startedwith Cryo and want to download a set of transactions, for ex, all txs between blocks 17042712:17042714 . I tried this with 2 differente RPCs for ETH Mainnet (Infura, Quicknode) and using both I see the error below.
Input much appreciated.

$ cryo txs  --csv -r https://greatest-skilled-seed.discover.quiknode.pro/XXXX/ -b 17042712:17042714 --requests-per-second 1
cryo parameters
───────────────
- datatypes: transactions
- network: ethereum
- min block: 17,042,712
- max block: 17,042,713
- total blocks: 2
- block chunk size: 2
- chunks to collect: 1 / 1
- max concurrent chunks: 3
- output format: csv
- output dir: /home/gabrielfior/code/cryo
- report file: .cryo/reports/2023-09-01_12-52-40.json


schema for transactions
───────────────────────
- block_number: int32
- transaction_index: int32
- transaction_hash: hex
- nonce: int32
- from_address: hex
- to_address: hex
- value_string: string
- value_f64: float64
- value_binary: binary
- input: hex
- gas_limit: uint32
- gas_price: uint64
- transaction_type: uint32
- max_priority_fee_per_gas: uint64
- max_fee_per_gas: uint64
- chain_id: uint64

sorting transactions by: block_number, transaction_index

other available columns: gas_used


collecting data
─────────────────
...done


collection summary
──────────────────
- total duration: 2.657 seconds
- t_start: 2023-09-01 12:52:40.799
- t_end:   2023-09-01 12:52:43.456
- chunks errored:   1 / 1
- chunks skipped:   0 / 1
- chunks collected: 0 / 1
- blocks collected: 0
- blocks per second: 0
- blocks per minute: 0
- blocks per hour:   0
- blocks per day:    0
Error: Some chunks failed

Location:
    crates/cli/src/main.rs:19:42
@sslivkoff sslivkoff changed the title Getting started - how can I fetch latest 20 transactions for example? CSV Output Broken Sep 1, 2023
@sslivkoff
Copy link
Member

it looks like there is a problem with csv output in the latest commit

your command seems to work with cryo v0.2.0

@kskalski
Copy link
Contributor

kskalski commented Sep 1, 2023

The error comes from polars csv writer:
ComputeError(ErrString("datatype [binary data] cannot be written to csv"))

@gabrielfior
Copy link
Author

I can confirm command above works when I remove the --csv, however it also panics if Itry to run with --json.
Might be worth opening a new ticket.

@sslivkoff
Copy link
Member

The error comes from polars csv writer: ComputeError(ErrString("datatype [binary data] cannot be written to csv"))

this probably means that the Hex output is not getting used. under normal circumstances cryo should go into Hex binary mode whenever using csv or json

@0xninjatron
Copy link

0xninjatron commented Sep 4, 2023

Hi i'm able to reproduce the error. Looks like value_binary is the issue, if I remove it then it works for me.

My debug statements on the dataframe:

    for column in df.iter() {
        println!("Column name: {}, Data type: {:?}", column.name(), column.dtype());
    }

Shows this:

    Column name: block_number, Data type: UInt64
    Column name: transaction_index, Data type: UInt64
    Column name: transaction_hash, Data type: Utf8
    Column name: nonce, Data type: UInt64
    Column name: from_address, Data type: Utf8
    Column name: to_address, Data type: Utf8
    Column name: value_binary, Data type: Binary
    Column name: value_string, Data type: Utf8
    Column name: value_f64, Data type: Float64
    Column name: input, Data type: Utf8
    Column name: gas_limit, Data type: UInt32
    Column name: gas_price, Data type: UInt64
    Column name: transaction_type, Data type: UInt32
    Column name: max_priority_fee_per_gas, Data type: UInt64
    Column name: max_fee_per_gas, Data type: UInt64
    Column name: chain_id, Data type: UInt64

If i remove value_binary it works:

 let column_names: Vec<String> = df
        .iter()
        .filter(|column| matches!(column.dtype(), DataType::Binary))
        .map(|column| column.name().to_owned())
        .collect();

    println!("filtered column_names to drop: {:#?}", column_names);

    *df = df.drop_many(&column_names);

If you must use value_binary perhaps convert it to something polars can handle, i do not think it's setup to handle Binary.

@kskalski
Copy link
Contributor

kskalski commented Sep 5, 2023

The problem comes from the fact that this code

let u256_types = if let Some(raw_u256_types) = &args.u256_types {
let mut u256_types: HashSet<U256Type> = HashSet::new();
for raw in raw_u256_types.iter() {
let u256_type = match raw.to_lowercase() {
raw if raw == "binary" => U256Type::Binary,
raw if raw == "string" => U256Type::String,
raw if raw == "str" => U256Type::String,
raw if raw == "f32" => U256Type::F32,
raw if raw == "float32" => U256Type::F32,
raw if raw == "f64" => U256Type::F64,
raw if raw == "float64" => U256Type::F64,
raw if raw == "float" => U256Type::F64,
raw if raw == "u32" => U256Type::U32,
raw if raw == "uint32" => U256Type::U32,
raw if raw == "u64" => U256Type::U64,
raw if raw == "uint64" => U256Type::U64,
raw if raw == "decimal128" => U256Type::Decimal128,
raw if raw == "d128" => U256Type::Decimal128,
_ => return Err(ParseError::ParseError("bad u256 type".to_string())),
};
u256_types.insert(u256_type);
}
u256_types
} else {
HashSet::from_iter(vec![U256Type::Binary, U256Type::String, U256Type::F64])
};
let binary_column_format = match args.hex | (output_format != FileFormat::Parquet) {
true => ColumnEncoding::Hex,
false => ColumnEncoding::Binary,
};
let log_decoder = match args.event_signature {
Some(ref sig) => LogDecoder::new(sig.clone()).ok(),
None => None,
};
let sort = parse_sort(&args.sort, &datatypes)?;
let schemas: Result<HashMap<Datatype, Table>, ParseError> = datatypes
.iter()
.map(|datatype| {
datatype
.table_schema(
&u256_types,
will just get the u256 encoding options (or default of a trio string,binary,f64) and pass it to create a table schema without checking whether output file type supports specific encoding.

This raises a couple of questions that I have about the introduction of u256:

  • is the new default to output 3 different encodings of the same data really a good idea? this inflates the output size unnecessarily by default
  • is generating multiple encoding types for the same column really useful / important?
  • adding type suffixes to the column names is both a breaking change and also a bit contrary to the existing design where column names try to follow origin data models / field names, while encoding of column contents is controlled separately - would it be better to adopt a global option for data encoding like with --hex command line arg?
  • in the current design, would (should?) combining --hex and --u256-types=binary fix the current issue by making the "binary" actually mean "hex"?

A backward compatible way would be to have:

  • instead of --u256-types, have something like --u256-encoding=string|binary|...
  • make it default to string
  • don't change the column names to follow the encoding type, but keep the old names (e.g. value in transactions)
  • can't decide if it's better to have also hex as another u256 encoding type or just let --hex automatically modify the meaning of binary type...

@kskalski
Copy link
Contributor

kskalski commented Sep 5, 2023

Actually the simplest fix seems to be modifying the u256 macro to check for global column encoding (#57). The other questions about u256 design remain, but they actually belong to a different discussion.

@sslivkoff
Copy link
Member

sslivkoff commented Sep 5, 2023

... from @kskalski

these are good questions. the current solution is an attempt at an optimal compromise between the constraints below

  • sometimes it will be useful to have multiple encodings. e.g. float for aggregate analytical queries vs binary/string for precise forensic accounting. if we need multiple encodings then we probably also need suffixes. a float encoding is necessary for the parquet files to be used with many native analytical tools because the commonly supported int types are too small
  • the default settings should be try to useful for a large number of circumstances, and also to be useful for users that are not advanced power users. having a default of binary/string/float64 makes it so that value is useful in many different circumstances. this comes at the cost of increased storage, however power users that want to optimize file size can either 1) alter binary encoding beforehand or 2) drop the undesired column from the parquet files after data collection
  • a situation we want to avoid is a noob user spends N hours/days collecting a dataset and then realizes it doesnt contain the data they want afterward. postprocessing could be used to re-encode the column to what the user actually wants, but this might be too much for a noob user. probably easier to have some amount of batteries included by default, then let power users reject the batteries if they don't want them
  • --u256-encoding might be a better name for the input arg. might even be able to use --u256 for easier typing, and it's probably semantic enough to understand from context what it means. --u256 is about the same level of explicitness as using --hex
  • backward compatibility is always nice, but cryo only came out a few months ago. backwards compatibility will become a higher priority once the data schemas are in an optimal state (I think we are getting close to this)

what do you think?

@sslivkoff
Copy link
Member

proposed changes

  • remove bytes from defaults, leaving only float64 and string. I don't currently use bytes for anything that I can't get out of float64 or string. this would cut down the file size slightly
  • rename --u256-types to --u256
  • if a user specifies binary in the u256 encoding list AND --hex is being used, then use --hex

@kskalski
Copy link
Contributor

kskalski commented Sep 6, 2023

Those sound good, I'm not sure about --u256 flag, it looks a bit like boolean gate flag, especially if you used before or put it near --hex, some other ideas: --u256-cols=string or --u256-as=binary.
BTW, we should allow specifying repeated args as a CSV list using https://docs.rs/clap/latest/clap/struct.Arg.html#method.value_delimiter
so it would be cryo --u256-as binary,string,f64 blocks

@gabrielfior
Copy link
Author

I believe this issue should be re-opened @sslivkoff . I tried the following command (exporting to CSV) and still received an error (I'm using branch main revision 7ca0073).

ETH_RPC_URL=https://mainnet.infura.io/v3/XXX cryo txs -b 16000000:16000005 -i transaction_hash --requests-per-second 20 --csv
 cryo    main ?2  ETH_RPC_URL=https://mainnet.infura.io/v3/XXXX cryo txs -b 16000000:16000005 -i transaction_hash --requests-per-second 20 --csv

cryo parameters
───────────────
- datatypes: transactions
- network: ethereum
- min block: 16,000,000
- max block: 16,000,004
- total blocks: 5
- block chunk size: 5
- chunks to collect: 1 / 1
- max concurrent chunks: 3
- output format: csv
- output dir: /home/gabrielfior/code/cryo
- report file: .cryo/reports/2023-09-06_19-40-10.json


schema for transactions
───────────────────────
- block_number: int32
- transaction_index: int32
- transaction_hash: hex
- nonce: int32
- from_address: hex
- to_address: hex
- value_binary: binary
- value_string: string
- value_f64: float64
- input: hex
- gas_limit: uint32
- gas_price: uint64
- transaction_type: uint32
- max_priority_fee_per_gas: uint64
- max_fee_per_gas: uint64
- chain_id: uint64

sorting transactions by: block_number, transaction_index

other available columns: gas_used


collecting data
─────────────────
...done


collection summary
──────────────────
- total duration: 2.214 seconds
- t_start: 2023-09-06 19:40:10.197
- t_end:   2023-09-06 19:40:12.412
- chunks errored:   1 / 1
- chunks skipped:   0 / 1
- chunks collected: 0 / 1
- blocks collected: 0
- blocks per second: 0
- blocks per minute: 0
- blocks per hour:   0
- blocks per day:    0
Error: Some chunks failed

Location:
    crates/cli/src/main.rs:19:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants