1.5.0-rc1 discussions #333

zevv · 2024-09-05T08:57:57Z

lacking a mailing list or forum, i took the liberty of opening an issue for discussing the 1.5.0-rc1 release. I'll use this to jot down some notes in no particular order, feel free to ignore or answer as you please :)

New database format

The default db format moved from tokio to tkrzw. It's about high time we leave tokiocabinet behind as it's just not stable and very much unmaintained. Some comments tho:

After upgrading duc gets confused by the new db format and throws a message Error opening: /home/ico/.cache/duc/duc.db - Database corrupt and not usable. Maybe we could add a hint here that the database format might be mismatching the current version and that the used should clean it up and re-index
Apart from the performance, the small size of the database files was the main reason I went with tokyocabinet years ago, as it manages to create considerably smaller databases than the other engines. On my little test setup, the new duc ends up with a database nearly 9 times larger than before - 71M to 590M just to index my home dir. I have not tested on humongous file systems since I do not have any, so I have no clue what this looks like when scaling up.

topn command

No comments - very useful and a welcome addition!

histogram support

Still early work on the reporting side I see, but at least the info is already there in the db, nice.

The text was updated successfully, but these errors were encountered:

l8gravely · 2024-09-05T23:59:10Z

I'll fix the DB checking issues, I thought I had already gotten that right, but obviously I missed something. Was the old DB on tokyocabinet format? I'll run some tests and see what I need to do.

bougui · 2024-10-09T13:32:18Z

Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in the next week.

Questions:

Will the new version also include the user option to only get usage of a specific user ?
Also I was told by a colleague ( I did not test it yet ), if we run duc for a specific user lets say bob and from what he says if a subfolder is not owned by bob duc will not go down this directory where bob could have some file eve if he is not the owner of that specific subfolder.

TIA

l8gravely · 2024-10-09T14:08:07Z

>>>> "Guillaume" == Guillaume Bourque ***@***.***> writes: Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in the next week. Questions: * Will the new version also include the user option to only get usage of a specific user ?

You should use able to use the 'duc index -u <user> /some/path' to build an index, but then you need to re-run for each and every user. Building a single DB with per-user info is not on the road map right now. It would explode the side of the DB. With version 1.5.0-rc1 now out (please test!) you also get a index of the top N files by size, which I find is a really useful feature. You get alot more bang for the buck by finding the biggest files in a filesystem, since they're usually easier to target to get back space.

* Also I was told by a colleague ( I did not test it yet ), if we run duc for a specific user lets say bob and from what he says if a subfolder is not owned by bob duc will not go down this directory where bob could have some file eve if he is not the owner of that specific subfolder.

We might have a bug in this area, so a bug report with examples would be nice to see. I'm flat out this week with other stuff, but I'll try to take a look starting next week to see what I find. But please do try the latest release candidate and file any bugs you find! John

bougui · 2024-10-09T17:20:58Z

Hello John,

since BD size is increasing with the new version and since duc must be used on large filesystem with more than one user, I would definitly add an option to keep per user size if request as an argument ;-) And since we have lots of space to check, we should have the place to keep a larger DB.

l8gravely · 2024-10-10T17:57:22Z

>>>> "Guillaume" == Guillaume Bourque ***@***.***> writes: since BD size is increasing with the new version and since duc must be used on large filesystem with more than one user, I would definitly add an option to keep per user size if request as an argument ;-) And since we have lots of space to check, we should have the place to keep a larger DB.

We're more than happy to look at patches to do this. I would suggest you only store the UID and keep per-UID records to store this info. Then have UID->username lookups done when the report is run. This does assume that the collector and display system have the same UID->name mappings. Which is probably a good assumption, but not certain. Otherwise we would need to keep another lookup table to map UIDs->username in the duc DB. But please help us out and try the new v1.5.0-rc1 release and let us know how it works for you! The more feedback the better. John

stuartthebruce · 2024-10-19T20:30:18Z

Is there an option in version 1.5 to specify the different compressors supported by tkrzw?

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?

stuartthebruce · 2024-10-20T20:33:10Z

FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,

[root@origin-staging ~]# duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw

[root@origin-staging ~]# time duc index -vp /backup2 -d /dev/shm/duc.db                                                 
Writing to database "/dev/shm/duc.db"
Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds.


real    489m11.917s
user    14m2.310s
sys     327m10.748s

It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,

[root@origin-staging ~]# ls -lh /dev/shm/duc.db 
-rw-r--r-- 1 root root 25G Oct 19 21:19 /dev/shm/duc.db

[root@origin-staging ~]# zstd --verbose -T0 /dev/shm/duc.db 
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Note: 24 physical core(s) detected 
/dev/shm/duc.db      : 64.69%   (26053963960 => 16853415278 bytes, /dev/shm/duc.db.zst) 

[root@origin-staging ~]# ls -lh /dev/shm/duc.db.zst 
-rw-r--r-- 1 root root 16G Oct 19 21:19 /dev/shm/duc.db.zst

l8gravely · 2024-10-21T13:48:12Z

>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes: Is there an option in version 1.5 to specify the different compressors supported by tkrzw?

Currently this is not an option. Do you have a need?

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?

That's a good point, I'll have to look into adding that. John

l8gravely · 2024-10-21T13:50:14Z

>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes: FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,

Woot! That's great news. Not so amazing in terms of the time it takes. For large pools like this it might be beter to do multiple scans in parallel, or we need to start thinking about how to parallelize the core indexing code.

***@***.*** ~]# duc --version duc version: 1.5.0-rc1 options: cairo x11 ui tkrzw ***@***.*** ~]# time duc index -vp /backup2 -d /dev/shm/duc.db Writing to database "/dev/shm/duc.db" Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds. real 489m11.917s user 14m2.310s sys 327m10.748s It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,

That is impressive space reduction. Or depressing depending on how you look at it. I'll see what I can come up with. I assume you're willing to run tests on proposed pateches?

***@***.*** ~]# ls -lh /dev/shm/duc.db -rw-r--r-- 1 root root 25G Oct 19 21:19 /dev/shm/duc.db ***@***.*** ~]# zstd --verbose -T0 /dev/shm/duc.db *** zstd command line interface 64-bits v1.4.4, by Yann Collet *** Note: 24 physical core(s) detected /dev/shm/duc.db : 64.69% (26053963960 => 16853415278 bytes, /dev/shm/duc.db.zst) ***@***.*** ~]# ls -lh /dev/shm/duc.db.zst -rw-r--r-- 1 root root 16G Oct 19 21:19 /dev/shm/duc.db.zst — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.*Message ID: <zevv/duc/issues/333/2425208851@

github.com>

l8gravely · 2024-10-21T14:05:46Z

>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes: FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool, ***@***.*** ~]# duc --version duc version: 1.5.0-rc1 options: cairo x11 ui tkrzw ***@***.*** ~]# time duc index -vp /backup2 -d /dev/shm/duc.db Writing to database "/dev/shm/duc.db" Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds. real 489m11.917s user 14m2.310s sys 327m10.748s It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,

Do you happen to have the tkrzw utils installed? Can you run the following and send me the results? I'm trying to pick better tuning defaults if I can. $ tkrzw_dbm_util inspect /dev/shm/duc.db and if you're feeling happy, please do: $ time tkrzw_dbm_util rebuild /dev/shm/duc.db $ tkrzw_dbm_util inspect /dev/shm/duc.db and send me those results as well. John

stuartthebruce · 2024-10-21T17:18:59Z

"stuartthebruce" == stuartthebruce @.***> writes: Is there an option in version 1.5 to specify the different compressors supported by tkrzw?
Currently this is not an option. Do you have a need?

Only to help test if a choice other than the current default helps with compressibility.

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?
That's a good point, I'll have to look into adding that. John

Thanks.

stuartthebruce · 2024-10-21T17:20:17Z

That is impressive space reduction. Or depressing depending on how you look at it. I'll see what I can come up with. I assume you're willing to run tests on proposed pateches?

Yes.

stuartthebruce · 2024-10-21T17:31:26Z

Do you happen to have the tkrzw utils installed?

I do now.

Can you run the following and send me the results? I'm trying to pick better tuning defaults if I can.

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect /dev/shm/duc.db
APPLICATION_ERROR: Unknown DBM implementation: db

With what I think are the right additional arguments?

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=3
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=1048583
  num_records=44090459
  eff_data_size=25474436170
  file_size=26053963960
  timestamp=1729397989.120004
  db_type=0
  max_file_size=8796093022208
  record_base=5246976
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26053963960
Number of Records: 44090459
Healthy: true
Should be Rebuilt: true

and if you're feeling happy, please do:

$ time tkrzw_dbm_util rebuild /dev/shm/duc.db

[root@origin-staging ~]# time tkrzw_dbm_util rebuild --dbm hash /dev/shm/duc.db                                                            
Old Number of Records: 44090459
Old File Size: 26053963960
Old Effective Data Size: 25474436170
Old Number of Buckets: 1048583
Optimizing the database: ... ok (elapsed=183.065716)
New Number of Records: 44090459
New File Size: 26489626808
New Effective Data Size: 25474436170
New Number of Buckets: 88180927

real    3m3.069s
user    2m31.424s
sys     0m30.468s

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=7
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=88180927
  num_records=44090459
  eff_data_size=25474436170
  file_size=26489626808
  timestamp=1729531678.856718
  db_type=0
  max_file_size=8796093022208
  record_base=440909824
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26489626808
Number of Records: 44090459
Healthy: true
Should be Rebuilt: false

l8gravely · 2024-10-21T19:01:23Z

>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes: Do you happen to have the tkrzw utils installed? I do now. Can you run the following and send me the results? I'm trying to pick better tuning defaults if I can. $ tkrzw_dbm_util inspect /dev/shm/duc.db ***@***.*** ~]# tkrzw_dbm_util inspect /dev/shm/duc.db APPLICATION_ERROR: Unknown DBM implementation: db With what I think are the right additional arguments?

Yes, those are the right args. Sorry! I should have tested here myself before sending out the request. tzkrzw used file extentions for format checking, which I find annoying.

***@***.*** ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db Inspection: class=HashDBM healthy=true auto_restored=false path=/dev/shm/duc.db cyclic_magic=3 pkg_major_version=1 pkg_minor_version=0 static_flags=49 offset_width=5 align_pow=3 closure_flags=1 num_buckets=1048583 num_records=44090459 eff_data_size=25474436170 file_size=26053963960 timestamp=1729397989.120004 db_type=0 max_file_size=8796093022208 record_base=5246976 update_mode=in-place record_crc_mode=none record_comp_mode=lz4 Actual File Size: 26053963960 Number of Records: 44090459 Healthy: true Should be Rebuilt: true and if you're feeling happy, please do: $ time tkrzw_dbm_util rebuild /dev/shm/duc.db ***@***.*** ~]# time tkrzw_dbm_util rebuild --dbm hash /dev/shm/duc.db Old Number of Records: 44090459 Old File Size: 26053963960 Old Effective Data Size: 25474436170 Old Number of Buckets: 1048583 Optimizing the database: ... ok (elapsed=183.065716) New Number of Records: 44090459 New File Size: 26489626808 New Effective Data Size: 25474436170 New Number of Buckets: 88180927 real 3m3.069s user 2m31.424s sys 0m30.468s $ tkrzw_dbm_util inspect /dev/shm/duc.db ***@***.*** ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db Inspection: class=HashDBM healthy=true auto_restored=false path=/dev/shm/duc.db cyclic_magic=7 pkg_major_version=1 pkg_minor_version=0 static_flags=49 offset_width=5 align_pow=3 closure_flags=1 num_buckets=88180927 num_records=44090459 eff_data_size=25474436170 file_size=26489626808 timestamp=1729531678.856718 db_type=0 max_file_size=8796093022208 record_base=440909824 update_mode=in-place record_crc_mode=none record_comp_mode=lz4 Actual File Size: 26489626808 Number of Records: 44090459 Healthy: true Should be Rebuilt: false

So this is interesting, it's now using lz4 compression (the best, but not the fastest) and it looks like it's taking more space, not less. But what I really wanted to see is how it changed the num_buckets, num_records, etc. It's using a crapload more buckets. Ok, so I've also got a patch to turn on zstd compression, which I'd like you to try. Let's see if I can attach it here without white space damage. You will need to do a full recompile, but it should do this by default. And now when you do 'duc --version' it will tell if it found zstd for use with tkrzw: $ ./duc --version Unknown option 'database' duc version: 1.5.0-rc1 options: cairo x11 ui tkrzw (zstd) Though I've got a bug in there to fix obviously. diff --git a/configure.ac b/configure.ac index d35c843..eee333f 100644 --- a/configure.ac +++ b/configure.ac @@ -83,7 +83,7 @@ case "${with_db_backend}" in AC_DEFINE([ENABLE_TKRZW], [1], [Enable tkrzw db backend]) ], [ AC_MSG_ERROR(Unable to find tkrzw) ]) AC_SUBST([TKRZW_LIBS]) -p AC_SUBST([TKRZW_CFLAGS]) + AC_SUBST([TKRZW_CFLAGS]) ;; leveldb) AC_CHECK_LIB([leveldb], [leveldb_open]) @@ -113,6 +113,11 @@ esac AC_DEFINE_UNQUOTED(DB_BACKEND, ["${with_db_backend}"], [Database backend]) +PKG_CHECK_MODULES([ZSTD],[libarchive]) +AC_DEFINE([DUC_TKRZW_COMP_ZSTD], ["RECORD_COMP_ZSTD"], ["Enable tkrzw db zstd comppression"]) +AC_DEFINE_UNQUOTED(TKRZW_ZSTD, ["${with_tkrzw_zstd}"], [tkrzw zstd compression support]) +AC_DEFINE([ENABLE_TKRZW_ZSTD], [1], [tkrzw with zstd]) + if test "${enable_cairo}" = "yes"; then PKG_CHECK_MODULES([CAIRO], [cairo],, [AC_MSG_ERROR([ @@ -204,6 +209,7 @@ AC_MSG_RESULT([ - Package version: $PACKAGE $VERSION - Prefix: ${prefix} - Database backend: ${with_db_backend} + - tkrzw ZSTD compression: ${with_tkrzw_zstd} - X11 support: ${enable_x11} - OpenGL support: ${enable_opengl} - UI (ncurses) support: ${enable_ui} diff --git a/src/duc/main.c b/src/duc/main.c index 287abfe..6aaea96 100644 --- a/src/duc/main.c +++ b/src/duc/main.c @@ -422,7 +422,11 @@ static void show_version(void) #ifdef ENABLE_UI printf("ui "); #endif - printf(DB_BACKEND "\n"); + printf(DB_BACKEND); +#ifdef ENABLE_TKRZW_ZSTD + printf(" (zstd)"); +#endif + printf("\n"); exit(EXIT_SUCCESS); } diff --git a/src/libduc/db-tkrzw.c b/src/libduc/db-tkrzw.c index 537e3ab..189f459 100644 --- a/src/libduc/db-tkrzw.c +++ b/src/libduc/db-tkrzw.c @@ -16,6 +16,13 @@ #include "private.h" #include "db.h" +// Enable compression using ZSTD if available +#ifdef DUC_TKRZW_COMP_ZSTD + #define DUC_TKRZW_REC_COMP "RECORD_COMP_ZSTD" +#else + #define DUC_TKRZW_REC_COMP "NONE" +#endif + struct db { TkrzwDBM* hdb; }; @@ -74,7 +81,9 @@ struct db *db_open(const char *path_db, int flags, duc_errno *e) if (flags & DUC_OPEN_RW) writeable = 1; if (flags & DUC_OPEN_COMPRESS) { /* Do no compression for now, need to update configure tests first */ - char comp[] = ",record_comp_mode=RECORD_COMP_LZ4"; + char comp[64]; + sprintf(comp,",record_comp_mode=%s",DUC_TKRZW_REC_COMP); + printf("opening tkzrw DB with compression\n"); strcat(options,comp); } diff --git a/src/libduc/db.c b/src/libduc/db.c index c18425b..64c521b 100644 --- a/src/libduc/db.c +++ b/src/libduc/db.c @@ -118,19 +118,23 @@ char *duc_db_type_check(const char *path_db) size_t len = fread(buf, 1, sizeof(buf),f); if (strncmp(buf,"Kyoto CaBiNeT",13) == 0) { - return("Kyoto Cabinet"); + return("kyotocabinet"); } if (strncmp(buf,"ToKyO CaBiNeT",13) == 0) { - return("Tokyo Cabinet"); + return("tokyocabinet"); } if (strncmp(buf,"TkrzwHDB",8) == 0) { - return("Tkrzw HashDBM"); + return("tkrzw"); } if (strncmp(buf,"SQLite format 3",15) == 0) { - return("SQLite3"); + return("sqlite3"); + } + + if (strncmp(buf,"SQLite format 3",15) == 0) { + return("lmdb"); } } diff --git a/src/libduc/duc.c b/src/libduc/duc.c index 193305d..b594b96 100644 --- a/src/libduc/duc.c +++ b/src/libduc/duc.c @@ -120,6 +120,18 @@ int duc_open(duc *duc, const char *path_db, duc_open_flags flags) return -1; } + // Check that we can handle this Database is what we're + // compiled to support, but only if it exists... + struct stat sb; + int r = stat(path_db,&sb); + if (r == 0) { + char *db_type = duc_db_type_check(path_db); + if (db_type && (strcmp(db_type,DB_BACKEND) != 0)) { + duc_log(duc, DUC_LOG_FTL, "Error opening: %s - unsupported DB type _%s_, duc compiled for %s", path_db, db_type, DB_BACKEND); + return -1; + } + } + duc_log(duc, DUC_LOG_INF, "%s database \"%s\"", (flags & DUC_OPEN_RO) ? "Reading from" : "Writing to", path_db); @@ -134,11 +146,6 @@ int duc_open(duc *duc, const char *path_db, duc_open_flags flags) /* Now we can maybe do some quick checks to see if we * tried to open a non-supported DB type. */ - char *db_type = duc_db_type_check(path_db); - if (db_type && (strcmp(db_type,"unknown") == 0)) { - duc_log(duc, DUC_LOG_FTL, "Error opening: %s - unsupported DB type _%s_, duc compiled for %s", path_db, db_type, DB_BACKEND); - return -1; - } } return 0; }

l8gravely self-assigned this Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.5.0-rc1 discussions #333

1.5.0-rc1 discussions #333

zevv commented Sep 5, 2024

l8gravely commented Sep 5, 2024

bougui commented Oct 9, 2024

l8gravely commented Oct 9, 2024 via email

bougui commented Oct 9, 2024

l8gravely commented Oct 10, 2024 via email

stuartthebruce commented Oct 19, 2024

stuartthebruce commented Oct 20, 2024

l8gravely commented Oct 21, 2024 via email

l8gravely commented Oct 21, 2024 via email

l8gravely commented Oct 21, 2024 via email

stuartthebruce commented Oct 21, 2024

stuartthebruce commented Oct 21, 2024

stuartthebruce commented Oct 21, 2024

l8gravely commented Oct 21, 2024 via email

1.5.0-rc1 discussions #333

1.5.0-rc1 discussions #333

Comments

zevv commented Sep 5, 2024

New database format

topn command

histogram support

l8gravely commented Sep 5, 2024

bougui commented Oct 9, 2024

l8gravely commented Oct 9, 2024 via email

bougui commented Oct 9, 2024

l8gravely commented Oct 10, 2024 via email

stuartthebruce commented Oct 19, 2024

stuartthebruce commented Oct 20, 2024

l8gravely commented Oct 21, 2024 via email

l8gravely commented Oct 21, 2024 via email

l8gravely commented Oct 21, 2024 via email

stuartthebruce commented Oct 21, 2024

stuartthebruce commented Oct 21, 2024

stuartthebruce commented Oct 21, 2024

l8gravely commented Oct 21, 2024 via email