Refactor freelist management & add more dedicated tests #789

ahrtr · 2024-07-08T14:43:35Z

Backgroud

There are some long standing data corruption issues, which indicate that there might be potential bug(s) in freelist management.

I am not satisfied with the bbolt freelist management for a long time. It's hard to understand and also tightly coupled with bbolt. I have been thinking to refactor it to improve the understandability & testability.

Refactor

The high level idea is to

simplify the implementation to improve understandability;
and introduce interface and decouple it with the bbolt TXN workflow to improve testability.

What we have done and are going to do:

move array related freelist functions into own file #777
Move method freePages into freelist.go #783
introduce a freelist interface #775
No need to handle freelist as a specical case when freeing a page #788
Panicking when a write txn tries to free a page which was allocated by itself #792
Simplify & update the logic related to the pending released free pages, refer to doc

We also need to continue to refactor & simplify the interface from user (bbolt) perspective, in other words, we should have a clear understanding on how the interface will & should be used by bbolt. The motivation is to improve testablity.

The freelist management is the most sensitive & important part. So let's do it step by step.

Test

Any unit tests are welcome.

But more importantly, we should add dedicated randomized test case to simulate concurrent (multiple) read TXNs and (single) writing TXN.

We need to record all the requests or operations sent to the freelist module;
We need to set up a list of expected behavior or invariable properties, and verify all the invariable properties are not broken during the test.

ahrtr · 2024-07-08T15:06:19Z

A good news is that I kept the concurrent test case running (and randomly kill & restart the test) for 22 days and no any issue occurred.
Read #769 (comment)

tjungblu · 2024-07-08T15:11:56Z

Here's one obvious addition I also wrote down during #775.

ReadWriter

Read / Write / EstimatedWritePageSize implementations are already well decoupled through interface functions and thus a very low hanging fruit to refactor.

I propose to split this into another struct, allowing us to embed/compose it in the freelist itself.

Nice side-effect: Copyall can go in there as a private implementation function, removed entirely from the interface.
tx_checker.go requires only a list of pending and free pages, which can be handled with other functions + sorting.

Benefit: four exported functions from the freelist interface less. Test cases become very simple serialization/deserialization of integer slices. We can improve this later on more easily, e.g. with vint compression to reduce the serialized size. We should definitely add a checksum hash there too...

decouple it with the bbolt TXN workflow

Yep, I think this is the biggest one. The main responsibility we still need to rethink seems to track when a page was allocated and when it was freed again (when in terms of transaction ID) and having strong assertions around those. The semantic leaking of "pending" pages also irks me a bit, this should be an implementation detail.

The remaining parts that care about different allocation styles and span merging seem OK to me. Even though there seems to be a lot of double-accounting and ensuring all data structures stay consistent at all times.

tjungblu · 2024-07-08T15:25:35Z

As for the testing, we might also want to consider digging up the lazyfs work in #567. I didn't get very far with getting meaningful tests up last year, but torn writes scare me a lot more now :)

ahrtr · 2024-07-18T12:35:16Z

There is a flaw on the freelist management.

The allocs is a mapping of Pgid -> Txid, see below,

bbolt/internal/freelist/shared.go

Line 22 in efc3eb6

    
           allocs        map[common.Pgid]common.Txid // mapping of Txid that allocated a pgid.

It only records the page IDs which are allocated from the freelist. But if the freelist can't meet the allocation requirement, then bbolt will try to resize the db file and allocate more disk space. In such case, the new allocated page IDs aren't recorded in the allocs.

bbolt/db.go

Lines 1159 to 1165 in efc3eb6

    
           p.SetId(db.freelist.Allocate(txid, count)) 
        
           if p.Id() != 0 { 
        
           	return p, nil 
        
           } 
        
           // Resize mmap() if we're at the end. 
        
           p.SetId(db.rwtx.meta.Pgid())

Usually it isn't a big problem. Because in such case, bbolt needs to re-map the db file, so it needs to wait for all the existing readonly TXNs to finish. But there is still a very small windows, in which new readonly TXNs may be created after re-mapping but before the writing TXN finishes committing.

The new allocated pages in such case are actually invisible to such readonly TXNs. But since the Txid (allocating the page ID) isn't recorded in such case, when the pages are freed in following writing TXNs, they can't be released as long as the readonly TXNs created in the small windows are still running.

It isn't a big problem,

it's low possibility to run into such case as mentioned above.
even when it happens, the only side effect is that the freed pages can't be released in time, accordingly can't be reused in time.

ahrtr · 2024-07-18T15:16:10Z

I am planning to simplify & update the logic related to the pending released free pages, please take a look at doc below and feedback if anyone is interested,
https://docs.google.com/spreadsheets/d/1T7ORCbv3mguq1hQoZkUg0XEzuUXFrNFFj1sUfQysZ2k/edit?gid=0#gid=0

tjungblu · 2024-07-22T13:36:54Z

It isn't a big problem,

I think the more general problem here is that we have tracking of transactions and their pages, but the allocation is also somewhere else entirely. I don't want to open another can of worms, but maybe we can also find a good way to incoperate the transaction and page allocation management along with it.

I am planning to simplify & update the logic related to the pending released free pages

the doc looks good, I see the code has some short-cuts around the txids to avoid looping more than necessary, we should start to simplify first though.

ahrtr · 2024-10-28T16:30:39Z

@tjungblu @fuweid @ivanvc @Elbehery since the refactoring may have a potential big impact, I propose to do it in bbolt 2.0 instead of 1.4.0. Note that etcd 3.6 will depend on bbolt 1.4, so that the refactor will not impact etcd 3.6. I am thinking to stablize the current bbolt main branch, and cut release-1.4 in a proper timing. The near / next plan is to release/tag 1.4.0-alpha.1 or 1.4.0-beta.1 as mentioned in #553 (comment) cc @jmhbnz @serathius @spzala

Noter we also need more review on #789 (comment), Of course target for 2.0.

ahrtr mentioned this issue Jul 9, 2024

Panicking when a write txn tries to free a page which was allocated by itself #792

Merged

tjungblu mentioned this issue Jul 11, 2024

ensure hashmap init clears maps #794

Merged

ahrtr mentioned this issue Jul 16, 2024

introduce a freelist interface #775

Merged

ahrtr pinned this issue Jul 18, 2024

ahrtr added the area/freelist label Jul 22, 2024

tjungblu mentioned this issue Jul 25, 2024

Improve freelist serialization #799

Open

ahrtr mentioned this issue Aug 12, 2024

The freelist.allocs isn't rollbacked when a tx is rollbacked #797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor freelist management & add more dedicated tests #789

Refactor freelist management & add more dedicated tests #789

ahrtr commented Jul 8, 2024 •

edited

Loading

ahrtr commented Jul 8, 2024

tjungblu commented Jul 8, 2024 •

edited

Loading

tjungblu commented Jul 8, 2024

ahrtr commented Jul 18, 2024

ahrtr commented Jul 18, 2024

tjungblu commented Jul 22, 2024 •

edited

Loading

ahrtr commented Oct 28, 2024

Refactor freelist management & add more dedicated tests #789

Refactor freelist management & add more dedicated tests #789

Comments

ahrtr commented Jul 8, 2024 • edited Loading

Backgroud

Refactor

Test

ahrtr commented Jul 8, 2024

tjungblu commented Jul 8, 2024 • edited Loading

ReadWriter

tjungblu commented Jul 8, 2024

ahrtr commented Jul 18, 2024

ahrtr commented Jul 18, 2024

tjungblu commented Jul 22, 2024 • edited Loading

ahrtr commented Oct 28, 2024

ahrtr commented Jul 8, 2024 •

edited

Loading

tjungblu commented Jul 8, 2024 •

edited

Loading

tjungblu commented Jul 22, 2024 •

edited

Loading