-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: option for no gap penalty in target start and end of both target/query #69
Comments
Does local alignment not produce the intended result, e.g., |
Not quite, a local align doesn't penalize the beginning of the query if there is a mismatch. |
Prior to version 2.4, there was only a single implementation of semi-global alignment that would not penalize beginning or end gaps from both the query and database sequence. That implementation still exists -- could you try it? Perhaps that is what you are requesting?
Do note that it can lead to some ambiguity when trying to determine the best alignment if the same score is possible using the entire query or only a subsequence of the query. |
I apologize if we're not communicating clearly. From the table below, do any of the semi-global modes match what you're requesting? If not, could you perhaps follow the example of the table and insert the kind of semi-global mode you are requesting?
|
Sorry, I'll try to explain more clearly. In the nomenclature in the table, what I'm interested in is
Here's a reworked example -- same as the original, but with a mismatch at the beginning of the gene query.
Output using
Output using
Output using
Desired output:
|
Would you be comfortable building a feature branch? I have made a first attempt at implementing what you have requested. git clone -b feature/more_sg https://github.com/jeffdaily/parasail.git
cd parasail
autoremake -fi
mkdir build
cd build
../configure
make -j $(nproc)
make -j $(nproc) check
./tests/test_69 # new test for this issue number You can update the tests/test_69.c code as-needed. Using your code example from this issue, I'm still getting the wrong score for the new function. Now to debug. |
I'm having trouble installing that branch. I'm not too familiar with build tools. It says "autoremake not found". Am I supposed to substitute $(nproc) for a number? |
Thanks, I got it to work. The output seems to always be identical to "sg_qe_db". |
I'm glad you got the build to work. This will help since you can evaluate my changes. I hope you have patience; this isn't going to be a quick/easy change. |
I think I figured out how to produce what you want, but it does change the way semi-global is implemented internally. The original semi-global mode ("sg") would not penalize both ends of both sequences. This lead to ambiguous cases that I resolved somewhat arbitrarily -- selecting the result that gave the shortest alignment. When I added all the new semi-global routines for the 2.4 release, they were naturally deterministic since they only penalized at most one of the two sequence beginnings or ends. What you are requesting is a local alignment interpretation when either both sequence beginnings or both sequence ends are not penalized. I understand that now. And from your "desired output" in an earlier comment, I was able to trace some of my debugging output to see how to make that happen. But I did warn you that this is likely not a quick/easy change, so please be patient. Thank you. |
@traversc I need your help with a new idea. I'm trying to preserve old behavior for maximum backwards compatibility, but I also see the value in your request. I'm thinking of adding even more new semi-global routines. Basically, instead of trying to disambiguate when we are penalizing the both beginnings or both ends, we take the "local" ("l", lowercase "L") approach. Here's a new table, and note the new column, new routines, and new names:
|
That looks quite comprehensive, but I don't fully understand the meaning of "stop local". E.g., how would "sgl" differ from Smith-Waterman (sw)? I think it would be easiest to grok with examples contrasting the different approaches. |
I'm having trouble understanding all the different approaches, as well. I don't have the cycles to spend a ton of time on this. Perhaps the name "stop local" is just poorly chosen. Except for the original semi-global implementation "sg", the first implementation(s) of semi-global were straightforward. Understanding that sequence alignment is a dynamic programming table:
The hard part now is we are introducing ambiguity.
As far as rendering the traceback goes, it was also straightforward when there was no ambiguity. There would always be at least the query or target without a beginning and/or gap penalty, so it made sense to me to render the entire alignment like we do with global alignment tracebacks. But with these ambiguous cases we are introducing, as pointed out by your original request, it was printing a series of insertions followed by a series of deletions or vice versa -- net very useful. The "stop local" caused my implementation to select the highest score in the table as the answer when not penalizing end gaps from both sequences. In the cases where we weren't penalizing both beginning gaps, it was intended to signal to my traceback function that it should clip the wasteful all-indels output from the beginning of the traceback. When I was playing around with the implementation of "sgl" versus plain local alignment, the only difference I could tell was that "sgl" allowed the scoring table to become negative instead of clamping the score to >= 0. In the resulting SW traceback, starting from the highest score in the table, when a 0 is found, the traceback stops. I wasn't sure what the "sgl" traceback should look like. Otherwise, yes, it was exactly the same as local alignment besides the possible negative score during the calculation. |
I don't see how I found an excel solver, which I could understand and modify to what I think should be "sg(l?)_qb_dx": https://i.imgur.com/xwb0Cf0.png This gives the expected alignment based on the traceback:
I'm not really sure if that helps, but for the question: "Not penalizing query and target end gaps? Should we select the highest score from the table like we do with local alignment?" I think the answer is "yes",but I also think that automatically follows from the two previous statements. "Not penalizing query end gaps? Select highest score from last column." + = Similarly, if we took the reverse complement of the example and did "sg_qx_de" you'd get: https://i.imgur.com/KZEiAZm.png Which corresponds to the alignment:
Hope I'm making sense :o |
What I meant about sgl producing negative numbers is that in the middle of the alignment algorithm the scores are allowed to be negative. Unlike with SW where scores are never allowed to become less than 0 even in the middle of the alignment. Thank you for the pictures of the score tables. |
- additional semi-global routines in sg_helper.h - update CMakeLists.txt with new semi-global dispatchers - update meson.build with new semi-global dispatchers - add desired output to test case 69 - update test case for #69
Here's a quick example of what I think would be useful:
This is almost what I'd want, but the desired output would be a score of 60.
The biological use case would be if I was trying to identify a recombinant gene or a chromosome translocation, where I expect the end of the alignment to not match up (or vice versa, the beginning).
Would it be possible to implement something like this?
Thanks for the great alignment package.
The text was updated successfully, but these errors were encountered: