Skip to content

Commit

Permalink
Add British Library books dataset (#3603)
Browse files Browse the repository at this point in the history
* loading script draft

* improve config naming

* move parsing code into function

* fix type hints

* fix default config name

* fix typo

Co-authored-by: Quentin Lhoest <[email protected]>

* add header

Co-authored-by: Quentin Lhoest <[email protected]>

* remove readlines call

Co-authored-by: Quentin Lhoest <[email protected]>

* update copyright date

* add citation to README

* update citation key

* update citation key

* add contact details

* add URLs to configs

* add url

* black formatting

* add config options to readme

* generate dataset_infos

* add dummy data

* fix tags

* Update datasets/blbooks/README.md

Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
  • Loading branch information
3 people authored Jan 31, 2022
1 parent 6c89c96 commit 4c417d5
Show file tree
Hide file tree
Showing 4 changed files with 781 additions and 0 deletions.
Loading

1 comment on commit 4c417d5

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010361 / 0.011353 (-0.000992) 0.004780 / 0.011008 (-0.006228) 0.039831 / 0.038508 (0.001323) 0.036507 / 0.023109 (0.013398) 0.346391 / 0.275898 (0.070493) 0.375956 / 0.323480 (0.052476) 0.007648 / 0.007986 (-0.000337) 0.005414 / 0.004328 (0.001085) 0.009900 / 0.004250 (0.005650) 0.039364 / 0.037052 (0.002312) 0.346798 / 0.258489 (0.088308) 0.375626 / 0.293841 (0.081785) 0.043398 / 0.128546 (-0.085148) 0.014310 / 0.075646 (-0.061336) 0.288432 / 0.419271 (-0.130839) 0.064930 / 0.043533 (0.021397) 0.344998 / 0.255139 (0.089859) 0.348630 / 0.283200 (0.065430) 0.114255 / 0.141683 (-0.027428) 1.953525 / 1.452155 (0.501370) 2.036390 / 1.492716 (0.543674)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.338470 / 0.018006 (0.320464) 0.533627 / 0.000490 (0.533138) 0.048015 / 0.000200 (0.047815) 0.000739 / 0.000054 (0.000684)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042112 / 0.037411 (0.004700) 0.027665 / 0.014526 (0.013139) 0.033448 / 0.176557 (-0.143109) 0.075411 / 0.737135 (-0.661725) 0.046408 / 0.296338 (-0.249931)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.552801 / 0.215209 (0.337592) 5.599749 / 2.077655 (3.522094) 2.259652 / 1.504120 (0.755532) 1.911626 / 1.541195 (0.370431) 1.901316 / 1.468490 (0.432826) 0.691034 / 4.584777 (-3.893743) 6.638672 / 3.745712 (2.892959) 3.134887 / 5.269862 (-2.134975) 1.611833 / 4.565676 (-2.953843) 0.084587 / 0.424275 (-0.339688) 0.014531 / 0.007607 (0.006924) 0.731351 / 0.226044 (0.505307) 7.883939 / 2.268929 (5.615010) 3.047520 / 55.444624 (-52.397104) 2.419689 / 6.876477 (-4.456788) 2.416134 / 2.142072 (0.274062) 0.890184 / 4.805227 (-3.915043) 0.173913 / 6.500664 (-6.326751) 0.070673 / 0.075469 (-0.004796)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.807622 / 1.841788 (-0.034166) 15.535543 / 8.074308 (7.461235) 44.617659 / 10.191392 (34.426267) 1.179021 / 0.680424 (0.498597) 0.650141 / 0.534201 (0.115940) 0.564323 / 0.579283 (-0.014960) 0.700177 / 0.434364 (0.265814) 0.378618 / 0.540337 (-0.161720) 0.394635 / 1.386936 (-0.992301)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009029 / 0.011353 (-0.002324) 0.007192 / 0.011008 (-0.003816) 0.033446 / 0.038508 (-0.005062) 0.033247 / 0.023109 (0.010138) 0.367490 / 0.275898 (0.091592) 0.356706 / 0.323480 (0.033226) 0.006336 / 0.007986 (-0.001649) 0.003922 / 0.004328 (-0.000407) 0.007877 / 0.004250 (0.003627) 0.035214 / 0.037052 (-0.001839) 0.330208 / 0.258489 (0.071719) 0.365098 / 0.293841 (0.071257) 0.043389 / 0.128546 (-0.085158) 0.012936 / 0.075646 (-0.062710) 0.280270 / 0.419271 (-0.139002) 0.070812 / 0.043533 (0.027279) 0.341617 / 0.255139 (0.086478) 0.373899 / 0.283200 (0.090700) 0.107289 / 0.141683 (-0.034394) 1.967412 / 1.452155 (0.515258) 1.992520 / 1.492716 (0.499804)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.269765 / 0.018006 (0.251759) 0.518831 / 0.000490 (0.518341) 0.000693 / 0.000200 (0.000493) 0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035021 / 0.037411 (-0.002390) 0.023836 / 0.014526 (0.009310) 0.030891 / 0.176557 (-0.145666) 0.076910 / 0.737135 (-0.660226) 0.036604 / 0.296338 (-0.259735)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.598111 / 0.215209 (0.382902) 5.884815 / 2.077655 (3.807160) 2.213402 / 1.504120 (0.709282) 1.872340 / 1.541195 (0.331145) 1.890033 / 1.468490 (0.421543) 0.712668 / 4.584777 (-3.872109) 6.389394 / 3.745712 (2.643682) 2.861361 / 5.269862 (-2.408501) 1.478328 / 4.565676 (-3.087349) 0.083284 / 0.424275 (-0.340991) 0.058594 / 0.007607 (0.050987) 0.748062 / 0.226044 (0.522017) 7.402994 / 2.268929 (5.134066) 3.045604 / 55.444624 (-52.399021) 2.255251 / 6.876477 (-4.621226) 2.228919 / 2.142072 (0.086847) 0.887659 / 4.805227 (-3.917568) 0.172469 / 6.500664 (-6.328195) 0.068655 / 0.075469 (-0.006814)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.912016 / 1.841788 (0.070229) 15.133940 / 8.074308 (7.059632) 41.568732 / 10.191392 (31.377340) 1.060358 / 0.680424 (0.379935) 0.596030 / 0.534201 (0.061829) 0.548450 / 0.579283 (-0.030833) 0.730354 / 0.434364 (0.295990) 0.378266 / 0.540337 (-0.162072) 0.394699 / 1.386936 (-0.992237)

CML watermark

Please sign in to comment.