-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201
Comments
I've seen this sort of thing when an insert-insert transition parameter is 0, which can happen after rounding off to a scaled integer. There's code to prevent this: scaled int insert parameters are bounded at -1, so there's always a penalty for an insertion. Maybe nhmmer is somehow overriding that? |
Poking around a bit - this is not specific to nhmmer. I get the same crazy deletion running hmmsearch:
That's >9000 model states aligned to 75 target sequence characters, which means a whole lot of delete states are getting used. In the HMM, all the delete transitions are the same, roughly 33% and 67%:
What I see: Then in At this point, I've run out of juice, and won't have the free time to hunt it any further for a while. Since this is generic to |
I've looked into this (at last). This is an unusual new case where H3's domain definition heuristics are failing. I won't be able to fix this in HMMER3, but I believe it has been solved by new algorithms that will be in HMMER4. I'm going to leave the issue open, tag it for H4, and we'll revisit it in the future. The problem arises from the fact that the profile here consists of ~30 multiple tandem repeats of ~380nt (plus a few kb of polyG; the polyG mononuc repeats are a red herring, they are not the problem). The target consists of a piece of one repeat. HMMER isn't designed for this; it's designed for the reverse, for your profile to be one "domain" or repeat, and for the target to consist of one or more domains/repeats. It would usually still work anyway (some Pfam models consist of two tandem domains, to increase signal/noise on some short tandem domain families), but this is a major factor in the problem. So one workaround is to define your profile to be a single ~380nt repeat. In more detail: the target aligns to this profile in ~30 different places. The marginalized posterior probability matrix, projected onto the target, says that the target definitely has a homologous region (an "envelope"). But the model is uncertain where the alignment is on the profile; with the posterior probability spread over ~30 possible alignments, all places have low posterior probability. You can see this uncertainty in the per-residue posterior probabilities, which are very low (2's; ~20% probable). A Viterbi optimal alignment would picks the highest-scoring possibility. But we do "optimal accuracy" alignment, not Viterbi, by default; OA looks for an alignment that has the greatest sum of posterior decoding probabilities. Because there's ~30 different places where the target envelope can align more or less as well to the profile, it just happens that in this particular example it can cobble together an OA alignment that picks up the first part of a repeat at one place in the profile and the rest of the repeat at a distant place in the profile. The result is a stupid alignment, but it is a "correct" consequence of the H3 domain definition heuristics interacting with OA alignment, as I defined the procedure - i.e. it's not a "bug" per se, I can't do anything about it without changing the algorithm. And that's what H4 will do. One of the main goals of H4 has been to get rid of the domain definition heuristics, which I've seen fail on some other rare cases too. Biologically, your profile is interesting and unusual. Although you have it annotated as a "putative TE family", this is not likely to be a transposon; it's an array of at least 30 tRNA genes, consisting of 4+ repeats of an Arg/iMet{x5-6} unit of an Arg tRNA followed by 5-6 iMet tRNAs. tRNAscan-SE output on the 18309nt consensus of your profile:
The metadata on the profile say there were 164 sequences in the alignment, all from one genome assembly of the redlip mullet Planiliza haematocheilus (a teleost fish; GCA_005024645.1). Are there really 164 copies of this ~18kb unit in that genome? That'd be ~4800 tRNAs; nobody needs that many iMet tRNAs, and (assuming the assembly is correct) there must be an interesting reason they're there. |
Thanks - the role of OA in these shenanigans makes sense. Your question about the tRNAs is a really good one. I'll pass it on to the developers of the model. |
The attached hmm and target sequence (both with .txt extension to force github to allow them to be uploaded) pair to yield a >9Kb long gap in an alignment.
target.fa.txt
DR0224948.hmm.txt
% nhmmer DR0224948.hmm target.fa > test.out
produces a single alignment with these summary stats:
In other words: >9Kb of model aligns to 76 target nucleotides - the alignment is basically just one big gap.
The query hmm contains several very long runs of G, including one that's almost 4KB long. That's a strange model, but still shouldn't produce this kind of alignment. I haven't identified a fix yet, but want to be sure this is documented, and am planning to investigate.
The text was updated successfully, but these errors were encountered: