Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5.0-rc1 discussions #333

Open
zevv opened this issue Sep 5, 2024 · 14 comments
Open

1.5.0-rc1 discussions #333

zevv opened this issue Sep 5, 2024 · 14 comments
Assignees

Comments

@zevv
Copy link
Owner

zevv commented Sep 5, 2024

Hi @l8gravely,

lacking a mailing list or forum, i took the liberty of opening an issue for discussing the 1.5.0-rc1 release. I'll use this to jot down some notes in no particular order, feel free to ignore or answer as you please :)

New database format

The default db format moved from tokio to tkrzw. It's about high time we leave tokiocabinet behind as it's just not stable and very much unmaintained. Some comments tho:

  • After upgrading duc gets confused by the new db format and throws a message Error opening: /home/ico/.cache/duc/duc.db - Database corrupt and not usable. Maybe we could add a hint here that the database format might be mismatching the current version and that the used should clean it up and re-index
  • Apart from the performance, the small size of the database files was the main reason I went with tokyocabinet years ago, as it manages to create considerably smaller databases than the other engines. On my little test setup, the new duc ends up with a database nearly 9 times larger than before - 71M to 590M just to index my home dir. I have not tested on humongous file systems since I do not have any, so I have no clue what this looks like when scaling up.

topn command

No comments - very useful and a welcome addition!

histogram support

Still early work on the reporting side I see, but at least the info is already there in the db, nice.

@l8gravely
Copy link
Collaborator

I'll fix the DB checking issues, I thought I had already gotten that right, but obviously I missed something. Was the old DB on tokyocabinet format? I'll run some tests and see what I need to do.

@l8gravely l8gravely self-assigned this Sep 5, 2024
@bougui
Copy link

bougui commented Oct 9, 2024

Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in the next week.

Questions:

  • Will the new version also include the user option to only get usage of a specific user ?

  • Also I was told by a colleague ( I did not test it yet ), if we run duc for a specific user lets say bob and from what he says if a subfolder is not owned by bob duc will not go down this directory where bob could have some file eve if he is not the owner of that specific subfolder.

TIA

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 9, 2024 via email

@bougui
Copy link

bougui commented Oct 9, 2024

Hello John,

since BD size is increasing with the new version and since duc must be used on large filesystem with more than one user, I would definitly add an option to keep per user size if request as an argument ;-) And since we have lots of space to check, we should have the place to keep a larger DB.

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 10, 2024 via email

@stuartthebruce
Copy link

Is there an option in version 1.5 to specify the different compressors supported by tkrzw?

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?

@stuartthebruce
Copy link

FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,

[root@origin-staging ~]# duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw

[root@origin-staging ~]# time duc index -vp /backup2 -d /dev/shm/duc.db                                                 
Writing to database "/dev/shm/duc.db"
Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds.


real    489m11.917s
user    14m2.310s
sys     327m10.748s

It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,

[root@origin-staging ~]# ls -lh /dev/shm/duc.db 
-rw-r--r-- 1 root root 25G Oct 19 21:19 /dev/shm/duc.db

[root@origin-staging ~]# zstd --verbose -T0 /dev/shm/duc.db 
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Note: 24 physical core(s) detected 
/dev/shm/duc.db      : 64.69%   (26053963960 => 16853415278 bytes, /dev/shm/duc.db.zst) 

[root@origin-staging ~]# ls -lh /dev/shm/duc.db.zst 
-rw-r--r-- 1 root root 16G Oct 19 21:19 /dev/shm/duc.db.zst

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@stuartthebruce
Copy link

"stuartthebruce" == stuartthebruce @.***> writes: Is there an option in version 1.5 to specify the different compressors supported by tkrzw?
Currently this is not an option. Do you have a need?

Only to help test if a choice other than the current default helps with compressibility.

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?
That's a good point, I'll have to look into adding that. John

Thanks.

@stuartthebruce
Copy link

That is impressive space reduction. Or depressing depending on how you look at it. I'll see what I can come up with. I assume you're willing to run tests on proposed pateches?

Yes.

@stuartthebruce
Copy link

Do you happen to have the tkrzw utils installed?

I do now.

Can you run the following and send me the results? I'm trying to pick better tuning defaults if I can.

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect /dev/shm/duc.db
APPLICATION_ERROR: Unknown DBM implementation: db

With what I think are the right additional arguments?

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=3
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=1048583
  num_records=44090459
  eff_data_size=25474436170
  file_size=26053963960
  timestamp=1729397989.120004
  db_type=0
  max_file_size=8796093022208
  record_base=5246976
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26053963960
Number of Records: 44090459
Healthy: true
Should be Rebuilt: true

and if you're feeling happy, please do:

$ time tkrzw_dbm_util rebuild /dev/shm/duc.db

[root@origin-staging ~]# time tkrzw_dbm_util rebuild --dbm hash /dev/shm/duc.db                                                            
Old Number of Records: 44090459
Old File Size: 26053963960
Old Effective Data Size: 25474436170
Old Number of Buckets: 1048583
Optimizing the database: ... ok (elapsed=183.065716)
New Number of Records: 44090459
New File Size: 26489626808
New Effective Data Size: 25474436170
New Number of Buckets: 88180927

real    3m3.069s
user    2m31.424s
sys     0m30.468s

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=7
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=88180927
  num_records=44090459
  eff_data_size=25474436170
  file_size=26489626808
  timestamp=1729531678.856718
  db_type=0
  max_file_size=8796093022208
  record_base=440909824
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26489626808
Number of Records: 44090459
Healthy: true
Should be Rebuilt: false

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants