-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for inputting .sam files and .fasta files (with read offset support) #474
Conversation
994c618
to
40d9217
Compare
2b36c6a
to
1ab4fc9
Compare
note: we need to increase the serialization version! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a first round. I'll need to take another look tomorrow. The preprocessing is quite difficult to grasp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good IMO. Some minor code style things that should be fixed.
And as I mentioned, the preprocessing is really hard to understand. In the long term we should think of a concept how we can make the connections between all the DuckDB tables clear. But this is not an issue of this PR.
Could you please rebase (and maybe squash already)? This is probably good to be merged then. |
7efcbbd
to
cb380cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
// optional header data | ||
while ((data.empty() || data.at(0) == '@') && getline(in_file.getInputStream(), data)) { | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line? Or could we also simply delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need it, some of the files have many header lines which we need to skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the ;
:)
After testing, this degrades preprocessing performance (with alpine base) for 2M GenBank sequences from: For 200k sequences: |
Let's try it on a full (open) dataset? Those number looks quite acceptable. It would be good to know how they scale on large datasets. |
I don't think it makes a big difference between 2M and 8M sequences, the general trend should stay the same |
57 minutes for the open dataset, whereas the before image needs 51 minutes |
resolves #416
Summary
PR Checklist