-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3 Planning & Objectives #194
Comments
As I began to understand the OpenMcdf code better and more generally the CFB format while investigating #184, I was thinking that it might be quite difficult to add the features I've previously mentioned to v2. So, at the end of last week I took the liberty and started some proof of concept work for a new approach for version 3. I've adapted/written code to parse headers and directory entries, enumerate FAT sectors, traverse sector chains and read the contents of a stream. There's obviously quite a lot of functionality missing (substorage traversal, reading mini FAT, any kind of writing), but it is at least enough to spin up the equivalent of the "InMemory" benchmark: Windows Structured Storage (ILockBytes over a MemoryStream)
OpenMcdf v2.3.1
OpenMcdf v3 Proof of Concept
So, there's some pretty big performance (400x faster for short reads) and memory reduction (Gen0 GCs are drastically reduced, and Gen1/2 GCs are eliminated) wins to be had on reading, while also enforcing reasonably strict validation. In the proof of concept, BinaryReader and BinaryWriter are extended to handle CFB types, and there is only one reader and one writer stored in a context (along with the header) that is shared across objects that need access to it. Sectors are lightweight structs mostly to record their ID and map to their position within the CFB stream/file. There are a couple of enumerators that do the main work: Although I haven't done any code to write data yet, I'm thinking the enumerators might be converted to mutable iterators which should also be reasonably fast/efficient. I'll share some code when I've cleaned it up and progressed a bit further! |
Honestly speaking... it's a perfect summary. |
@ironfede Yesterday, I added a dedicated enumerator for directory entries in a FAT chain and another enumerator for the directory tree, so it can now traverse storages and streams as part of a tree (it could only traverse them as a list before) . I also improved the enumerators so they're a bit easier to follow and improved validation (enumerators and sector offsets throw if you try to access something that's invalid/out of bounds). I'll have a look at implementing support for mini FAT sectors today, then I think I'll be to the point where I'll have something worth sharing for comments and feedback. But essentially, aside from some clean-up and further validation/testing, I think the POC already meets the following objectives:
|
@Numpsy Looks like you might have a particular interest in the OLE Property Set Data Structures? Do you have anything you'd like to see for v3? I can't say I know too much about it, so aside from some nit-pick refactoring work my only real comment is that perhaps OpenMcdf.Extensions should be renamed to OpenMcdf.Ole (i.e. explicitly about OLE only). Especially since we probably won't require a decorator for streams in v3. |
My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue. As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes. |
There's also scope for a more complete set of functions for adding/updating/deleting properties (e.g. #190) - that's a higher level API than the changes to the storage part. |
My use case is similar to @Numpsy but on Solid Edge (CAD) documents, besides the standard SummaryInformation/DocumentSummaryInformation, some more app-specific metadata streams need to be accessed. Regular documents contain one part per file however some documents may contain more than one part; this means the file's structure becomes multilevel and each part has its own SummaryInformation/DocumentSummaryInformation and so on. If needed I can provide example files with an explanation of their structure. |
@jeremy-visionaid that's great, let me know where to upload the files besides a brief descriptions of them |
@farfilli Maybe you could push them with on a branch of a fork and put some descriptions in the commit message? |
@jeremy-visionaid done in 0c8fc18 |
Okeydoke, I think I've progressed with the v3 proof of concept far enough that I'm happy to push it for some early feedback. There are some notable things still missing:
I'll at least try and add some comments tomorrow, and see if I can add some additional features as above through the week, but comments and feedback would be appreciated! |
I've pushed some changes with some refactoring, improved terminology, filled out the summary comments and improved validation for corrupt CFBs. The reading part at least should be in pretty good shape if anyone wants to take it for a spin and try it out. I could bring the structured storage explorer across if it's of interest? Otherwise it's just unit tests right now. |
Thank you @jeremy-visionaid for all your work. I'm already working on a base v3, including some deep structural changes in manipulation of sector chains and streams in general to set a background to evolve in a future for a SWMR scenario and for a performance enhancement in write operations. I hope to put a draft online in short time since I understand that there's (correctly) a little bit of pressure on this... :-) |
@ironfede That's no worries, it's been quite fun! I did consider trying to work with v2.4 as a base, but I've handled sectors, chains and streams in a totally different way, changed and improved the error handling etc., and even the terminology for types and members is a bit different. Since I'm the one that wants most of the v3 features and robustness etc,, I'm also happy to do the bulk of the work, but I think it was just more efficient for me to start afresh in order to get there. I'd be curious as to what things you were thinking about with regards to SWMR? Although it's not a goal I had in mind, I think the architecture of what I've done in the proof of concept would likely lend itself quite well to that situation already. I have been thinking ahead a bit towards transactions, consolidation and scratch data, etc., but I only really have an idea about how they could be achieved with the POC. It's not so much that there's a huge amount of pressure to get v3 out immediately. I can probably work around the transactions and consolidation stuff on the client side. Releasing v2.4.0 would be helpful though, since v2.3 currently crashes our app, so I can't yet merge the changes to use it. Updating to v3 I hope would improve the performance, enable larger files and reduce memory and disk space requirements, etc, but it otherwise might not be a release blocker. (Though before we go to RTM with our application, I do need to validate some things with 2.4 - or possibly even 3.0 instead depending on how things go on that front). Though I've been somewhat busy with other tasks this week, I hope might find some time to bring the POC up to feature parity with v2.4 around the middle of next week, but I'd also love to see your draft 3.0 to see how it compares. |
There are some points in a version 3 that are honestly a little bit difficult to implement in a clean way. The first is supporting LARGE files for v4 CFB (16TB Stream as specs define). Only this point alone means that a chain with no data should have more than 4 billions of indexes and this is absolutely a huge number of Sector indexes to support so I'm thinking how to partialize chain loading for sections of requested sectors in order to manipulate only required read/write sections. The natural "old" way to manipulate this type of things should be a memory file mapping but this means that all structures need to be rethinked in terms of structs with no reference at all in a C-way and this could be difficult to mantain in the long run. On the other hand, using a Stream-ed approach, I think that some functions to obtain partial chains sections should be introduced... |
That's actually largely the point of why I rewrote the sector/chain/stream handling in the proof of concept. The POC is designed/intended to handle arbitrarily long files/chains/streams, but I couldn't really see how to adapt v2.4 to handle that... I made a start on handling writing, but its limited to in-place alterations at the moment, allowing arbitrarily large writes while deferring commits is still to do, but I think I might have that going in another day or two |
@jeremy-visionaid , @molesmoke , @Numpsy and everybody interested, I've added 3.0 branch. Many thanks to everybody! |
@ironfede Thanks for making that branch. I'll keep pushing to #199 for now, and we can retarget it/rework it as appropriate once the investigation is complete. I've pushed some more reading related improvements to #199 that I found while working on the writing part (though I'm aware that more can still be done). But here's a quick summary of the current POC state/design: CfbBinaryReader/Writer: Read and write CFB types (e.g. GUIDs, DateTimes, Headers, DirectoryEntries) A lot of the initial shortcomings are still present, but I'd like to get writing basically finished before neating it up and filling it out. I'm currently working on the following bits: The writing is obviously a fair bit more complex than the reading, but I'll push something when I reached some reasonable milestone if the above design works out OK. Although I can write a CFB file and read it back, it is still very much work in progress. |
What about implementing #202 in v3? Basically, I need to create \ read \ edit \ delete any kind of available property streams, any advanced function that validates and \ or helps accomplish these tasks would be very welcome. |
@farfilli I'm afraid I'm only really working on the core library, not the extensions. I don't want to speak for @ironfede, but IMHO, you and @Numpsy could reasonably continue to make changes to those parts on the v2 branch (regardless of whether it gets released or not)... It should be a trivial matter to port the extensions across to the v3 core (and I don't mind just handling the porting part to keep the extensions working). |
A quick status update on the POC. I didn't get as much time to work on it this week as I would have liked (I still need to implement transitions to/from mini FAT streams), and haven't really done much with transactions/scratch data at all. I also wrote some benchmarks, suffice to say the POC is about 30% faster at writing compared to v2.3, but is a fair bit slower than the reference implementation, so I'd like to clean it up and optimize it a bit more before sharing it. So, still work in progres at this stage... |
I had some optimizations I wanted to get them out of my brain and into the world, so I have some write benchmarks to share! Windows Structured Storage (ILockBytes from MemoryStream)
OpenMcdf v2.3.1
OpenMcdf v3-poc
I have a couple of last optimizations I'd like to make still... The small buffer writes are allocating more memory than they probably need to, I can probably bring that down a bit. I think the speed of the large buffer writes can also be improved by reserving FAT sectors (for the additional FAT entries) and then reserving data sectors in one contiguous block. Currently FAT sectors are added on the fly, and writes are chunked by the sector size, but if they're chunked by contiguous runs of sectors instead, then that might give a bit of a performance boost. |
So, I'm a little behind where I'd hoped to be as I've had other things I needed to work on. I've pushed the direct write support to the poc branch. The write performance measurements on the previous were out since it wasn't actually using the mini stream. So, that's fixed, along with a bunch of other fixed and improvements. It has dropped a bit of performance as well - I should be able to get some of that back, but still a fair bit faster than v2.3.
In terms of feature partity:
|
I've added transactions (i.e commit/revert). The allocations for in-memory benchmarks look relatively high compared to v2.3 because the modifications get written to a stream rather than individual buffers for the Sectors (and the performance is accordingly worse since it needs to copy every time the capacity needs to be increased). It seems like an unfair comparison to use an in-memory benchmark for that, given the feature I'm interested in is using temp files for scratch/transaction data (which essentially require streams). So, the following is for file streams:
Unfortunately running a fair bit slower than the reference (Windows Structured Storage) at the moment.
|
So, I found the main reason for the performance issue was a miscalculation when extending the scratch data stream, which made it a fair bit longer than it needed to be. I also added a cache for the FAT's last used sector to speed up FAT lookups too. That makes it faster than the native implementation again for straight reads and writes, and somewhat close to parity with the native implementation for transactions.
Still some room for improvement and still some features to implement... but looking OK. |
Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?
If I understand correctly, the values/objectives for v2 are:
There are some goals I have in mind for v3:
Other thoughts:
The text was updated successfully, but these errors were encountered: