CSV Output Broken #55

gabrielfior · 2023-09-01T15:53:44Z

Not sure if this is the right forum, hopefully it is.

I'm just getting startedwith Cryo and want to download a set of transactions, for ex, all txs between blocks 17042712:17042714 . I tried this with 2 differente RPCs for ETH Mainnet (Infura, Quicknode) and using both I see the error below.
Input much appreciated.

$ cryo txs  --csv -r https://greatest-skilled-seed.discover.quiknode.pro/XXXX/ -b 17042712:17042714 --requests-per-second 1
cryo parameters
───────────────
- datatypes: transactions
- network: ethereum
- min block: 17,042,712
- max block: 17,042,713
- total blocks: 2
- block chunk size: 2
- chunks to collect: 1 / 1
- max concurrent chunks: 3
- output format: csv
- output dir: /home/gabrielfior/code/cryo
- report file: .cryo/reports/2023-09-01_12-52-40.json


schema for transactions
───────────────────────
- block_number: int32
- transaction_index: int32
- transaction_hash: hex
- nonce: int32
- from_address: hex
- to_address: hex
- value_string: string
- value_f64: float64
- value_binary: binary
- input: hex
- gas_limit: uint32
- gas_price: uint64
- transaction_type: uint32
- max_priority_fee_per_gas: uint64
- max_fee_per_gas: uint64
- chain_id: uint64

sorting transactions by: block_number, transaction_index

other available columns: gas_used


collecting data
─────────────────
...done


collection summary
──────────────────
- total duration: 2.657 seconds
- t_start: 2023-09-01 12:52:40.799
- t_end:   2023-09-01 12:52:43.456
- chunks errored:   1 / 1
- chunks skipped:   0 / 1
- chunks collected: 0 / 1
- blocks collected: 0
- blocks per second: 0
- blocks per minute: 0
- blocks per hour:   0
- blocks per day:    0
Error: Some chunks failed

Location:
    crates/cli/src/main.rs:19:42

The text was updated successfully, but these errors were encountered:

sslivkoff · 2023-09-01T16:36:39Z

it looks like there is a problem with csv output in the latest commit

your command seems to work with cryo v0.2.0

kskalski · 2023-09-01T17:50:18Z

The error comes from polars csv writer:
ComputeError(ErrString("datatype [binary data] cannot be written to csv"))

gabrielfior · 2023-09-01T18:03:22Z

I can confirm command above works when I remove the --csv, however it also panics if Itry to run with --json.
Might be worth opening a new ticket.

sslivkoff · 2023-09-01T19:11:08Z

The error comes from polars csv writer: ComputeError(ErrString("datatype [binary data] cannot be written to csv"))

this probably means that the Hex output is not getting used. under normal circumstances cryo should go into Hex binary mode whenever using csv or json

0xninjatron · 2023-09-04T04:42:21Z

Hi i'm able to reproduce the error. Looks like value_binary is the issue, if I remove it then it works for me.

My debug statements on the dataframe:

    for column in df.iter() {
        println!("Column name: {}, Data type: {:?}", column.name(), column.dtype());
    }

Shows this:

    Column name: block_number, Data type: UInt64
    Column name: transaction_index, Data type: UInt64
    Column name: transaction_hash, Data type: Utf8
    Column name: nonce, Data type: UInt64
    Column name: from_address, Data type: Utf8
    Column name: to_address, Data type: Utf8
    Column name: value_binary, Data type: Binary
    Column name: value_string, Data type: Utf8
    Column name: value_f64, Data type: Float64
    Column name: input, Data type: Utf8
    Column name: gas_limit, Data type: UInt32
    Column name: gas_price, Data type: UInt64
    Column name: transaction_type, Data type: UInt32
    Column name: max_priority_fee_per_gas, Data type: UInt64
    Column name: max_fee_per_gas, Data type: UInt64
    Column name: chain_id, Data type: UInt64

If i remove value_binary it works:

 let column_names: Vec<String> = df
        .iter()
        .filter(|column| matches!(column.dtype(), DataType::Binary))
        .map(|column| column.name().to_owned())
        .collect();

    println!("filtered column_names to drop: {:#?}", column_names);

    *df = df.drop_many(&column_names);

If you must use value_binary perhaps convert it to something polars can handle, i do not think it's setup to handle Binary.

kskalski · 2023-09-05T07:26:47Z

The problem comes from the fact that this code

cryo/crates/cli/src/parse/query.rs

Lines 91 to 133 in d918f14

    
           let u256_types = if let Some(raw_u256_types) = &args.u256_types { 
        
               let mut u256_types: HashSet<U256Type> = HashSet::new(); 
        
               for raw in raw_u256_types.iter() { 
        
                   let u256_type = match raw.to_lowercase() { 
        
                       raw if raw == "binary" => U256Type::Binary, 
        
                       raw if raw == "string" => U256Type::String, 
        
                       raw if raw == "str" => U256Type::String, 
        
                       raw if raw == "f32" => U256Type::F32, 
        
                       raw if raw == "float32" => U256Type::F32, 
        
                       raw if raw == "f64" => U256Type::F64, 
        
                       raw if raw == "float64" => U256Type::F64, 
        
                       raw if raw == "float" => U256Type::F64, 
        
                       raw if raw == "u32" => U256Type::U32, 
        
                       raw if raw == "uint32" => U256Type::U32, 
        
                       raw if raw == "u64" => U256Type::U64, 
        
                       raw if raw == "uint64" => U256Type::U64, 
        
                       raw if raw == "decimal128" => U256Type::Decimal128, 
        
                       raw if raw == "d128" => U256Type::Decimal128, 
        
                       _ => return Err(ParseError::ParseError("bad u256 type".to_string())), 
        
                   }; 
        
                   u256_types.insert(u256_type); 
        
               } 
        
               u256_types 
        
           } else { 
        
               HashSet::from_iter(vec![U256Type::Binary, U256Type::String, U256Type::F64]) 
        
           }; 
        
           let binary_column_format = match args.hex | (output_format != FileFormat::Parquet) { 
        
               true => ColumnEncoding::Hex, 
        
               false => ColumnEncoding::Binary, 
        
           }; 
        
           let log_decoder = match args.event_signature { 
        
               Some(ref sig) => LogDecoder::new(sig.clone()).ok(), 
        
               None => None, 
        
           }; 
        
           let sort = parse_sort(&args.sort, &datatypes)?; 
        
           let schemas: Result<HashMap<Datatype, Table>, ParseError> = datatypes 
        
               .iter() 
        
               .map(|datatype| { 
        
                   datatype 
        
                       .table_schema( 
        
                           &u256_types,

will just get the u256 encoding options (or default of a trio string,binary,f64) and pass it to create a table schema without checking whether output file type supports specific encoding.

This raises a couple of questions that I have about the introduction of u256:

is the new default to output 3 different encodings of the same data really a good idea? this inflates the output size unnecessarily by default
is generating multiple encoding types for the same column really useful / important?
adding type suffixes to the column names is both a breaking change and also a bit contrary to the existing design where column names try to follow origin data models / field names, while encoding of column contents is controlled separately - would it be better to adopt a global option for data encoding like with --hex command line arg?
in the current design, would (should?) combining --hex and --u256-types=binary fix the current issue by making the "binary" actually mean "hex"?

A backward compatible way would be to have:

instead of --u256-types, have something like --u256-encoding=string|binary|...
make it default to string
don't change the column names to follow the encoding type, but keep the old names (e.g. value in transactions)
can't decide if it's better to have also hex as another u256 encoding type or just let --hex automatically modify the meaning of binary type...

kskalski · 2023-09-05T08:07:24Z

Actually the simplest fix seems to be modifying the u256 macro to check for global column encoding (#57). The other questions about u256 design remain, but they actually belong to a different discussion.

sslivkoff · 2023-09-05T19:51:26Z

... from @kskalski

these are good questions. the current solution is an attempt at an optimal compromise between the constraints below

sometimes it will be useful to have multiple encodings. e.g. float for aggregate analytical queries vs binary/string for precise forensic accounting. if we need multiple encodings then we probably also need suffixes. a float encoding is necessary for the parquet files to be used with many native analytical tools because the commonly supported int types are too small
the default settings should be try to useful for a large number of circumstances, and also to be useful for users that are not advanced power users. having a default of binary/string/float64 makes it so that value is useful in many different circumstances. this comes at the cost of increased storage, however power users that want to optimize file size can either 1) alter binary encoding beforehand or 2) drop the undesired column from the parquet files after data collection
a situation we want to avoid is a noob user spends N hours/days collecting a dataset and then realizes it doesnt contain the data they want afterward. postprocessing could be used to re-encode the column to what the user actually wants, but this might be too much for a noob user. probably easier to have some amount of batteries included by default, then let power users reject the batteries if they don't want them
--u256-encoding might be a better name for the input arg. might even be able to use --u256 for easier typing, and it's probably semantic enough to understand from context what it means. --u256 is about the same level of explicitness as using --hex
backward compatibility is always nice, but cryo only came out a few months ago. backwards compatibility will become a higher priority once the data schemas are in an optimal state (I think we are getting close to this)

what do you think?

sslivkoff · 2023-09-05T20:02:37Z

proposed changes

remove bytes from defaults, leaving only float64 and string. I don't currently use bytes for anything that I can't get out of float64 or string. this would cut down the file size slightly
rename --u256-types to --u256
if a user specifies binary in the u256 encoding list AND --hex is being used, then use --hex

kskalski · 2023-09-06T06:30:17Z

Those sound good, I'm not sure about --u256 flag, it looks a bit like boolean gate flag, especially if you used before or put it near --hex, some other ideas: --u256-cols=string or --u256-as=binary.
BTW, we should allow specifying repeated args as a CSV list using https://docs.rs/clap/latest/clap/struct.Arg.html#method.value_delimiter
so it would be cryo --u256-as binary,string,f64 blocks

gabrielfior · 2023-09-06T22:41:07Z

I believe this issue should be re-opened @sslivkoff . I tried the following command (exporting to CSV) and still received an error (I'm using branch main revision 7ca0073).

ETH_RPC_URL=https://mainnet.infura.io/v3/XXX cryo txs -b 16000000:16000005 -i transaction_hash --requests-per-second 20 --csv

 cryo    main ?2  ETH_RPC_URL=https://mainnet.infura.io/v3/XXXX cryo txs -b 16000000:16000005 -i transaction_hash --requests-per-second 20 --csv

cryo parameters
───────────────
- datatypes: transactions
- network: ethereum
- min block: 16,000,000
- max block: 16,000,004
- total blocks: 5
- block chunk size: 5
- chunks to collect: 1 / 1
- max concurrent chunks: 3
- output format: csv
- output dir: /home/gabrielfior/code/cryo
- report file: .cryo/reports/2023-09-06_19-40-10.json


schema for transactions
───────────────────────
- block_number: int32
- transaction_index: int32
- transaction_hash: hex
- nonce: int32
- from_address: hex
- to_address: hex
- value_binary: binary
- value_string: string
- value_f64: float64
- input: hex
- gas_limit: uint32
- gas_price: uint64
- transaction_type: uint32
- max_priority_fee_per_gas: uint64
- max_fee_per_gas: uint64
- chain_id: uint64

sorting transactions by: block_number, transaction_index

other available columns: gas_used


collecting data
─────────────────
...done


collection summary
──────────────────
- total duration: 2.214 seconds
- t_start: 2023-09-06 19:40:10.197
- t_end:   2023-09-06 19:40:12.412
- chunks errored:   1 / 1
- chunks skipped:   0 / 1
- chunks collected: 0 / 1
- blocks collected: 0
- blocks per second: 0
- blocks per minute: 0
- blocks per hour:   0
- blocks per day:    0
Error: Some chunks failed

Location:
    crates/cli/src/main.rs:19:42

sslivkoff changed the title ~~Getting started - how can I fetch latest 20 transactions for example?~~ CSV Output Broken Sep 1, 2023

kskalski mentioned this issue Sep 5, 2023

fix(u256) hex output check for u256 series generation. #57

Merged

2 tasks

sslivkoff closed this as completed in #57 Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV Output Broken #55

CSV Output Broken #55

gabrielfior commented Sep 1, 2023 •

edited

Loading

sslivkoff commented Sep 1, 2023

kskalski commented Sep 1, 2023

gabrielfior commented Sep 1, 2023

sslivkoff commented Sep 1, 2023

0xninjatron commented Sep 4, 2023 •

edited

Loading

kskalski commented Sep 5, 2023 •

edited

Loading

kskalski commented Sep 5, 2023

sslivkoff commented Sep 5, 2023 •

edited

Loading

sslivkoff commented Sep 5, 2023

kskalski commented Sep 6, 2023

gabrielfior commented Sep 6, 2023

CSV Output Broken #55

CSV Output Broken #55

Comments

gabrielfior commented Sep 1, 2023 • edited Loading

sslivkoff commented Sep 1, 2023

kskalski commented Sep 1, 2023

gabrielfior commented Sep 1, 2023

sslivkoff commented Sep 1, 2023

0xninjatron commented Sep 4, 2023 • edited Loading

kskalski commented Sep 5, 2023 • edited Loading

kskalski commented Sep 5, 2023

sslivkoff commented Sep 5, 2023 • edited Loading

sslivkoff commented Sep 5, 2023

kskalski commented Sep 6, 2023

gabrielfior commented Sep 6, 2023

gabrielfior commented Sep 1, 2023 •

edited

Loading

0xninjatron commented Sep 4, 2023 •

edited

Loading

kskalski commented Sep 5, 2023 •

edited

Loading

sslivkoff commented Sep 5, 2023 •

edited

Loading