-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data structure representing account candidates for cleaning #2296
Conversation
9f4a90e
to
7c5cd45
Compare
8a14193
to
77b9935
Compare
accounts-db/src/accounts_db.rs
Outdated
store.accounts.scan_pubkeys(|key| { | ||
let index = self.accounts_index.bin_calculator.bin_from_pubkey(key); | ||
let mut pubkeys_bin = pubkeys[index].write().unwrap(); | ||
pubkeys_bin.insert(*key, CleaningInfo::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also create per-thread binned pubkeys and then reduce that at the end. It may reduce lock contention.
I think what we have here is likely good as a v0. The map-reduce suggestion could be done in a subsequent PR.
accounts-db/src/accounts_db.rs
Outdated
let index = self.accounts_index.bin_calculator.bin_from_pubkey(key); | ||
let mut pubkeys_bin = pubkeys[index].write().unwrap(); | ||
pubkeys_bin.insert(*key, CleaningInfo::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the same idea here too.
accounts-db/src/accounts_db.rs
Outdated
@@ -2742,86 +2755,95 @@ impl AccountsDb { | |||
/// 1. one of the pubkeys in the store has account info to a store whose store count is not going to zero | |||
/// 2. a pubkey we were planning to remove is not removing all stores that contain the account | |||
fn calc_delete_dependencies( | |||
purges: &HashMap<Pubkey, (SlotList<AccountInfo>, RefCount)>, | |||
purges: &Vec<RwLock<HashMap<Pubkey, CleaningInfo>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should drop the RwLock
at some point when it is no longer necessary, probably:
let purges = purges.into_iter().map(|x| x.into_inner().unwrap()).collect::<Vec<_>>()
This simplifies fns like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also look for places where we are not executing in parallel where perhaps we should be. and maybe the rwlocks help us, but probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if we separate (key, value) into separate vectors as Haoran suggested we don't need RwLock at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should drop the
RwLock
at some point when it is no longer necessary, probably: let purges = purges.into_iter().map(|x| x.into_inner().unwrap()).collect::<Vec<_>>() This simplifies fns like this one.
I think it only works like this
purges.into_iter().map(|x| {
let y = x.read().unwrap();
y.clone()
}).collect::<Vec<_>>()
because x is a shared reference and RwLock::into_inner cannot move out of a shared reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works and is reasonable (doesn't copy the hashmaps):
let purges = candidates.into_iter().map(|x| std::mem::take(&mut *x.write().unwrap())).collect::<Vec<_>>();
I tried Box::into_inner(candidates), but it complained about not knowing the size at compile time. I don't have a lot of experience with Box
. I'm only saying that at some point we can get rid of the RwLock
in the candidates. They were convenient early but no downstream code uses them at the moment. It only adds complexity and the possibility of deadlocks and poisoned rwlocks...
This isn't necessary for this pr, I don't think. And, I think the rwlocks could enable better threading/higher performance in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, the separating keys/values is an interesting idea. I'd have to think more. I think what we have is fine for now.
looks like the right direction to me! I see my comments overlapped brooks. I knew he would have a lot to say. |
I really appreciate the feedback, and will make the necessary changes. Thank you! |
bf23120
to
cb51c76
Compare
fyi, I pulled this into the editor and worked through it. There was 1 more place we needed to filter. Then, there were several places with code duplication and several places where cargo fmt is apparently not correct. I made all these changes in individual commits and pushed them. |
What should I do with this PR? Is more work required within this PR or other issues can be addressed in follow-up work? |
if slot_list.is_empty() { | ||
continue; // seems simpler than filtering. `candidates` contains all the pubkeys we original started with | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a correctness or a performance change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correctness. we previously would have not gotten in this loop for any pubkey that was not a zero lamport account. I don't know what happens if we have no slot list here. I don't want to find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but please verify my assertion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Nice catch!
once we test it once more, I think this pr is good. and once brooks/haoran review (and you look at it again). Bugs here can be very tricky, so this is a great exercise in really digging into this code. |
I'm concerned we're changing behavior in addition to the refactoring/reshaping of the data types. If possible, I'd find it preferable to have separate PRs for changing behavior. Maybe these changes are due to the refactoring/reshaping. I'll do another more substantive review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the PR again. It looks good to me.
Just one nit for clean up, which can be done either in this PR or in a follow-up PR.
@@ -3130,10 +3129,6 @@ impl AccountsDb { | |||
|
|||
timings.delta_key_count = Self::count_pubkeys(&candidates); | |||
|
|||
let mut hashset_to_vec = Measure::start("flat_map"); | |||
hashset_to_vec.stop(); | |||
timings.hashset_to_vec_us += hashset_to_vec.as_us(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha. it was never dumped out anyway.
This morning, i started 2 mnb and 1 testnet validator with this change rebased to master. |
I tried to address all recent comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for answering all my questions and addressing all my comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks for working through all the issues. We want future debugging to be easier, not harder. I think we've all learned new things about clean while also making it use less memory and run faster.
I pulled down the full change again and verified every line.
I have 2 mnb and 1 testnet validator running on this. All are running fine. We verified this on a ledger that was ooming at startup. No longer ooming.
Problem
AccountsDB::clean_accounts
makes unnecessary copies of large number ofpubkeys and accompanying information to find and operate on the
accounts that can be deleted from the accounts index.
Summary of Changes
With this change the candidates for deletion are organized in a single data structure
with necessary information being updated in-place, thus reducing
memory requirements of the cleaning procedure.
Fixes #