Add checkpointing to metagraph build #197

danieldanciu · 2020-09-08T16:52:35Z

As discussed via chat, there are multiple checkpoints after each major operation: generate kmers, generate reverse complements, splitting into chunks, generate dummy1, concatenate chunks, generate dummy2..k.

The client interface exposes only 2 phases:

phase 1 builds everything up to and including dummy k-mers
phase 2 finishes the build

I tested by inserting exit statements in the code after each phase and making sure the build is successful on resume.
I also added functional tests for building in phases.

karasikov · 2020-09-08T17:31:28Z

Could you add a test where it builds one graph, kills the process, so it fails to build to the end.
Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated.

So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine.

danieldanciu · 2020-09-08T17:37:40Z

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic.

…

On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov ***@***.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q> .

danieldanciu · 2020-09-09T07:10:31Z

Ok, I added a test that builds 2 graphs at the same time. First it builds phase 1 for both graphs, then it builds phase2 for both and asserts that the graphs are correctly built.

…

On Tue, 8 Sep 2020 at 19:37, Daniel Danciu ***@***.***> wrote: I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic. On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov ***@***.***> wrote: > Could you add a test where it builds one graph, kills the process, so it > fails to build to the end. > Then it does the same thing for a different graph. Then restarts building > both graphs in parallel, and checks that both graphs are correctly > generated. > > So we check that there are no collisions here and you can correctly > generate multiple graphs in parallel on the same machine. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#197 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q> > . >

…an be resumed

metagraph/src/graph/representation/succinct/build_checkpoint.hpp

metagraph/src/graph/representation/succinct/build_checkpoint.cpp

karasikov · 2020-09-18T15:53:25Z

metagraph/src/graph/representation/succinct/build_checkpoint.cpp

+    if (std::filesystem::exists(checkpoint_file_)) {
+        std::ifstream f(checkpoint_file_);
+        f >> checkpoint_;
+        if (checkpoint_ > 0) {
+            f >> kmer_dir_;


Write the phase too. Otherwise, how do you know which phase this checkpoint corresponds to?

If you assume that different phases have different checkpoints, like 1: (1,2,3), 2: (4,5,6), then make the interface not allow setting checkpoint = 0 for the second phase.

Like you said, the phases correspond to different checkpoints, so writing the phase is not useful - it just tells the program up to which checkpoint it should go before stopping.
So if the users enters phase=1, we might make it up to checkpoint 2, and then resume the operation from checkpoint 2 and finalize the phase.

Setting checkpoint_=0 for phase_=2 is perfectly valid. One tells us how far we got with the computation, the other how far we wish to go. The other way around, setting checkpoint_=5 and phase_=1 could be invalid, but in our case phase_=1 corresponds to having all checkpoints done, so any combination of valid checkpoint (0..5) and phase values (1..2) is fine.

metagraph/src/graph/representation/succinct/boss_chunk_construct.cpp

karasikov · 2020-09-18T16:58:07Z

metagraph/src/graph/representation/succinct/boss_chunk_construct.cpp

@@ -703,7 +819,7 @@ void recover_dummy_nodes(const KmerCollector &kmer_collector,
                    return kmer::transform<KMER>(reinterpret_cast<const KMER_REAL &>(v), k + 1) + kmer_delta;
                }
            },
-            real_split_by_W, true
+            real_split_by_W, false /* remove sources */


looks like a variable name commented out

Suggested change

real_split_by_W, false /* remove sources */

real_split_by_W, false // remove sources

That's the standard way of commenting parameters represented by constants. You can only use // here because the param happens to be at the end of the line.

metagraph/src/graph/representation/succinct/boss_chunk_construct.cpp

karasikov · 2020-09-18T17:02:04Z

It's better to have non-deterministic tests than no tests.
The expected result is well determined. The second run of metagraph started after a previous run that was killed, must construct a valid graph.
So it's a perfectly defined test, isn't it?

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic.
…
On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov @.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q .

danieldanciu added 9 commits September 4, 2020 09:43

Intermediate

8567bff

First attempt

97294cd

Running'

1f731ff

Clear checkpoint

7e981ba

Small

b9fbc53

Add --phase

d83368d

Working checkpointing

0b27107

Added functional tests

7d53b52

Small changes self review

7476e7a

Add integration tests for parallel building

fa07b12

danieldanciu added 13 commits September 10, 2020 09:48

Remove forgotten optnone

8301c22

Minor rename

71ac989

Support filesystem

12e1ada

Mor elogging

da1aea6

small

bc63e50

small

0756a42

small

a2e6c0f

Don't clean up unmerged files

15b42da

Don't clean up unmerged files in SortedSetDisk, so that computation c…

8a9121b

…an be resumed

10 chunks

9fabdb5

Wait for merging before stopping

a4fb98c

Small fix

be4d767

Actually wait for merge to happen

89d1588

danieldanciu requested a review from hmusta September 17, 2020 19:55

Write checkpoint after phase1

4c73921

karasikov requested changes Sep 18, 2020

View reviewed changes

danieldanciu added 25 commits October 18, 2020 14:01

merged with dev

007845f

Remove double declaration

866e786

Merged with dev

d52ca9e

Default to phase 3

5a3a8a3

Skip phase 2 if no rc's are generated

ad6d965

Flush sorted set at end of phase

6493adb

Better temp dir

57c111c

Don't push kmers into queue if phase is < 3

e6c0d95

Merged with dev

7cd5b63

Set checkpoint to 2 when RC's are not being generated

80f1e48

Clean up temp files in SSD

de1173c

Merge remote-tracking branch 'origin/dev' into phase

fb0ad63

Remove trace logs

5e42d10

s/remove/remove_all

d728edd

Merged with dev

79d97e4

Simplify test_build_phase

03d5b2d

Small fix in checkpoint continuation

e4aed78

Verbose mode for test_build_phase

82a9e8e

A bit more debugging

76c0d1b

All trace logs

2a5b52f

Reset kmers when continuing from cp 1

dc278db

Skip phase 2

98ad5da

Copy file names

8c953a6

Acquire lock when flushing

c238b20

Count orig/rc

0d399b7

karasikov force-pushed the dev branch 2 times, most recently from 58db659 to 5d81032 Compare January 15, 2021 23:52

karasikov force-pushed the dev branch from 56b0b10 to 70adfd6 Compare February 12, 2021 18:11

karasikov changed the base branch from dev to master June 10, 2021 09:11

hmusta marked this pull request as draft October 25, 2023 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpointing to metagraph build #197

Add checkpointing to metagraph build #197

danieldanciu commented Sep 8, 2020

karasikov commented Sep 8, 2020

danieldanciu commented Sep 8, 2020 via email

danieldanciu commented Sep 9, 2020 via email

karasikov Sep 18, 2020

danieldanciu Sep 19, 2020

karasikov Sep 18, 2020

danieldanciu Sep 19, 2020

karasikov commented Sep 18, 2020

	real_split_by_W, false /* remove sources */
	real_split_by_W, false // remove sources

Add checkpointing to metagraph build #197

Are you sure you want to change the base?

Add checkpointing to metagraph build #197

Conversation

danieldanciu commented Sep 8, 2020

karasikov commented Sep 8, 2020

danieldanciu commented Sep 8, 2020 via email

danieldanciu commented Sep 9, 2020 via email

karasikov Sep 18, 2020

Choose a reason for hiding this comment

danieldanciu Sep 19, 2020

Choose a reason for hiding this comment

karasikov Sep 18, 2020

Choose a reason for hiding this comment

danieldanciu Sep 19, 2020

Choose a reason for hiding this comment

karasikov commented Sep 18, 2020