-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HashStore Utility/Converter #95
Comments
That makes sense. So maybe in the new move method, it should do nothing or
just delete the temp file. The Converter.convert method will do the job to
create the hard link - it has the original path and it has the
ObjectMetadata instance which has the sha256 checksum ( and can break it
down to the file path), so link them together.
…On Thu, Jul 25, 2024 at 1:45 PM Dou Mok ***@***.***> wrote:
Unfortunately, simply overriding move will not be enough to address this
issue. To create a proper hard link, the original supplied path pathToDoc
must be used as the source.
Using the existing storeObject flow, and overriding the final move
operation, will actually only create a link between the tmpFile generated
and the target permanent address. As a result, this link points to the new
inodes/data blocks allocated for tmpFile, rather than referencing the
original inodes of pathToDoc.
image.png (view on web)
<https://github.com/user-attachments/assets/36a15e8b-45dc-45e5-ae02-ff2af81606a9> zenuml
Syntax
zenuml
title HashStoreConverter Process
Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
"new InputStream existDocStream"
FileHashStoreLinks.storeHardLink(existDocStream, pid) {
FileHashStore.storeObject(existDocStream, pid) {
syncPutObject
writeToTmpFileAndGenerateChecksums
move
// -
// Override 'move'
// Creates a hard link
FileHashStoreLinks->FileHashStoreLinks
return ObjectMetadata
}
FileHashStore.storeMetadata(sysmeta, pid) {
// -
// If sysmeta fails to store,
// an exception will be thrown.
// -
// The hard link/tags created
// for the data obj will remain.
return pathToSysmeta
}
return ObjectMetadata
}
return ObjectMetadata
}
Loading
Instead of overriding move, we will not follow the storeObject flow
directly but indirectly (like in the scenario where a client receives a
data stream first before metadata). However, instead of calling
storeObject, we will directly call writeToTmpFileAndGenerateChecksums.
Since we are really only after the map of checksums/hex digests, there is
no need to follow the storeObject synchronization process as the tmpFile
being written into is discarded after. Once that is completed, we can call
tagObject which is thread-safe and synchronized, then create the hard
link, and lastly store the sysmeta.
image.png (view on web)
<https://github.com/user-attachments/assets/18d68535-bf4c-4260-8e55-ee7b47bc2e97> zenuml
Syntax New Flow
zenuml
title HashStoreConverter Process
Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
"new InputStream existDocStream"
FileHashStoreLinks.storeHardLink(existDocStream, pid) {
FileHashStore.generateTmpFile {
return tmpFile
}
FileHashStore.writeToTmpFileAndGenerateChecksums {
return hexDigests/Checksums
}
delete(tmpFile)
// -
// 'tagObject' is synchronized/thread safe
FileHashStore.tagObject(pid, cid)
FileHashStore.getHashStoreDataObjectPath(cid) {
return cidExpectedObjectPath
}
createHardLink(existDoc, cidExpectedObjectPath)
FileHashStore.storeMetadata(sysmeta, pid) {
// -
// If sysmeta fails to store,
// an exception will be thrown.
// -
// The hard link/tags created
// for the data obj will remain.
return pathToSysmeta
}
return ObjectMetadata
}
return ObjectMetadata
}
Loading
—
Reply to this email directly, view it on GitHub
<#95 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5QQDFJNFJCP6QOMHBMUW3ZOFPXVAVCNFSM6AAAAABLNKH2YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGM3DKMJWHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for the quick feedback @taojing2002! I believe we actually don't need to override/touch the
Updated zenuml diagram zenuml Syntax Converter Process Updatedzenuml
title HashStoreConverter Process
Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
"new InputStream existDocStream"
FileHashStoreLinks.storeHardLink(existDoc, existDocStream, pid) {
FileHashStore.generateTmpFile {
return tmpFile
}
FileHashStore.writeToTmpFileAndGenerateChecksums {
return hexDigests/Checksums
}
delete(tmpFile)
// -
// 'tagObject' is synchronized/thread safe
FileHashStore.tagObject(pid, cid)
FileHashStore.getHashStoreDataObjectPath(cid) {
return cidExpectedObjectPath
}
createHardLink(existDoc, cidExpectedObjectPath)
return ObjectMetadata
}
// -
// Close object stream
"existDocStream.close()"
FileHashStoreLinks.storeMetadata(sysmeta, pid) {
// -
// If sysmeta fails to store,
// an exception will be thrown.
// -
// The hard link/tags created
// for the data obj will remain.
FileHashStore.storeMetadata(sysmeta, pid) {
return pathToSysmeta
}
return pathToSysmeta
}
// -
// - Close sysmeta stream
"sysmeta.close()"
return ObjectMetadata
}
|
Sounds good. To me, it sounds reasonable to create the hard link first.
Creating the link means the object is ready, then you can tag it.
…On Thu, Jul 25, 2024 at 3:06 PM Dou Mok ***@***.***> wrote:
Thank you for the quick feedback @taojing2002
<https://github.com/taojing2002>! I believe we actually don't need to
override the move method at all. The HashStoreConverter.convert method
will make two calls to FileHashStoreLinks:
- storeHardLink(Path existDoc, InputStream existDocStream, String pid)
- This will get the 5 default checksums by writing into and then
deleting a tmpFile
- Then it will call tagObject, and finally create the actual hard
link afterwards
- I am contemplating creating the hard link first before calling
tagObject
- storeMetadata(InputStream sysmeta, String pid)
Updated zenuml diagram
image.png (view on web)
<https://github.com/user-attachments/assets/15992a57-c52f-44bb-987a-bb317d3a8379> zenuml
Syntax Converter Process Updated
zenuml
title HashStoreConverter Process
Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
"new InputStream existDocStream"
FileHashStoreLinks.storeHardLink(existDoc, existDocStream, pid) {
FileHashStore.generateTmpFile {
return tmpFile
}
FileHashStore.writeToTmpFileAndGenerateChecksums {
return hexDigests/Checksums
}
delete(tmpFile)
// -
// 'tagObject' is synchronized/thread safe
FileHashStore.tagObject(pid, cid)
FileHashStore.getHashStoreDataObjectPath(cid) {
return cidExpectedObjectPath
}
createHardLink(existDoc, cidExpectedObjectPath)
return ObjectMetadata
}
// -
// Close object stream
"existDocStream.close()"
FileHashStoreLinks.storeMetadata(sysmeta, pid) {
// -
// If sysmeta fails to store,
// an exception will be thrown.
// -
// The hard link/tags created
// for the data obj will remain.
FileHashStore.storeMetadata(sysmeta, pid) {
return pathToSysmeta
}
return pathToSysmeta
}
// -
// - Close sysmeta stream
"sysmeta.close()"
return ObjectMetadata
}
Loading
—
Reply to this email directly, view it on GitHub
<#95 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5QQDCBTZMF6T4JCLQYOYTZOFZGDAVCNFSM6AAAAABLNKH2YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGQ4DCMBUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@doulikecookiedough Some idle noodling on all this, and how long it will take... Process is still writing, not linkingThe purpose of the use of a hard link is to completely avoid having to rewrite the data to disk. If you call Batching with CephFS APIAlso, one of the things that it would be good to avoid is running an iterative loop across 2 million+ objects and calling a method on each, which will take a long time on its own. When I first mentioned this optimization, I mentioned that it would be much faster to call the CephFS API to modify the metadata server directly (hopefully in batch ops) than it would be to use the posix link Possible alternative to considerAn alternative approach could be below, that I have no idea whether it will work or is supported by the Ceph API, jst food for thought:
Some quick metrics on our system write speedsOf course, this is entirely out-of-band from both hashstore and metacat, but it might be orders of magnitude faster than trying to insert millions of objects one call at a time. In particular, once (2) is done, you can then massively parallelize (3) on our k8s cluster (although we will be quickly limited by read I/O rate). Because this process is out-of-band (and specific to our systems), it probably would not be the best automatic upgrade path for small metacat deployments. Rather, I see this as a pre-processing step to prepare a hashstore dir and set of checksums for our large installations that have a lot of objects (e.g., ADC/KNB/cn.dataone.org). Smaller metacat installs might just use the Hashstore API you are designing to convert directly. A few stats follow for reference. I created a directory with 100 3 GiB files on our cephfs filesystem (named Hardlinking all of the files was fast, and took 2.0 seconds, which is an average of 50 files per second, or 150 GiB/s
Copying all of the data to a new file was I/O bound, took 4m58s, which is an average of 1 GiB/s
We have about 140TB of data on the ADC. So, at that rate, it would take about
So, that's about half the time of the file writes, and hopefully something that we can parallelize much more extensively on the k8s cluster because sha calculations are generally compute-bound. All just food for thought and discussion. Not sure what you should do, but thought it was worth pointing out your "link" op is really a "write" op. |
Thank you for sharing your insightful thoughts with us @mbjones 🙏! If I call RE: Alternative Approach & CephFS I will let this simmer in my mind for a bit, your feedback has been extremely helpful! |
@doulikecookiedough thanks. In my comments above, I made a pretty big calculation error on my data throughput rates, which I edited above. So instead of 3982 hours for our 140TB cp, that should have been 39.82 hours. Pretty huge difference. Wanted to point this out as it changes my thinking on the scale of this conversion. |
Thank you again @mbjones for sharing your thoughts with us and the update. I realize that I may have created some confusion regarding our approach to the conversion process as well. While Metacat will likely set up the process with a loop, the execution phase will utilize Java's concurrency APIs (TBD, perhaps via the Java Collection object's The out-of-band process described sounds quite promising - but Metacat will eventually also need to store the system metadata for each data object. Given this, along with how we want to store a hard link to optimize disk usage, I feel that the |
Update: After further discussion with @taojing2002 and @artntek, we have agreed to combine the Metacat DB Upgrade & HashStore conversion process together.
When operators configure the DB or initiate the process, it will also automatically trigger the HashStore conversion.
RE: Jumbo Repositories (KNB, DataONE, ADC) - Minimize Downtime (<2 minutes or less?)
|
A note on the jumbo repos comment -- the reason that I think the downtime can be kept to a few minutes is that the conversion of files from the existing store to HashStore can be done ahead of time without loss (and hopefully without re-writing existing data), recording the checksum results for later update in Metacat. A rough pseudo algorithm might be:
Hope this is helpful. Other approaches would be feasible too, this is just one possible path. |
@taojing2002 I have tested the
For the normal For the hardlinking process No unexpected errors - so creating a hard link is about 50% faster than the normal process. |
This has been completed via Feature-95: HashStoreConverter & FileHashStoreLinks - however, leaving this issue open to continue discussing plans/strategy to reduce downtime for jumbo repos. |
The hard link conversion process was run for all data objects on knbvm via the hashstore client and took approximately For quick reference: mok@knbvm:/var/metacat$ du -h -d 1
0 ./benchmark
0 ./inline-data
du: cannot read directory './.metacat': Permission denied
0 ./.metacat
1.5G ./documents
512 ./users
178K ./logs
5.9T ./hashstore
du: cannot read directory './config': Permission denied
0 ./config
910M ./solr-home3
du: cannot read directory './tdb/tdb16272684195302037564': Permission denied
409G ./tdb
4.5K ./dataone
507G ./data
421M ./temporary
57K ./certs
6.8T . |
After further discussion with the team, we have confirmed that HashStore will assist the client, Metacat, in the migration process for existing data and metadata objects by providing a new utility class. Metacat will coordinate the iterative process so that the respective default checksum list for each data object is stored. This class,
HashStoreConverter
, has a single convert method:pid
.This utility class will need to call a new class
FileHashStoreLinks
which extendsFileHashStore
. We will have a new public methodstoreHardLink(...)
which will follow the same existing flow asstoreObject
. Except this method will eventually arrive at a new override method formove
, which callsFiles.createLink
instead ofFiles.move
.Process Diagram (via @taojing2002 and @artntek)
To Do:
HashStoreConverter
convert(Path pathToDoc, String pid, Stream systema)
FileHashStoreLinks
storeHardLink(InputStream object, String pid)
move(Path sourcePath, Path targetPath)
The text was updated successfully, but these errors were encountered: