List of TODOs and other quick remarks.
By definition, now that we are more conservative with checking neighbours before pruning.
Is this
Same for Heuristic. Compilation is very slow after enumerating over all possible
implementations in algorithms.rs
.
WIP, but not so efficient yet.
- Cache locality
- data can be a slice from larger vector.
Not much speedup, but fixes a potential bug because checking f_queue < f
isn’t
always accurate in context of pruning.
Double-expands slightly more now, but retries much less, because the check for
g_queue == g
(which just ignores the element if false), makes for skipping
some retries.
Currently this can only work if the pruned match is preceded by another exact match, since expanded states just above/left of the pruned position will be larger than the pruned position in the transformed domain.
For large n and e=0.01 or e=0.05, this reduces the number of retries by 10x to 100x.
This only works when we only push values equal to the minimum f or 1 larger (so that a single swap is sufficient).
Allocating all the vectors is slow. Also reserve size for the hashmap.
This can simplify Contours datastructures
Instead of storing a vec per contour, we can take adjacent slices of on larger vector. When all contours only contain one point, this is much more compact.
These can be very small, so fit in L1 cache and can quickly discard elements not in the hashmap.
Given a seed, find the best match in b. Then find a lower bound on the cost of aligning all other matches of the seed. For something like k=20, e=0.1, we may have an on-diagonal match of cost 2, and find that all other matches have cost at least in the range 5-10. This allows much more aggressive pruning.
Similar, but does all kmers instead of disjoint kmers.
- Maximize h(0,0) or r/k
- Minimize number of extra seeds.
- A: Each seed does not match, and covers exactly max_dist+1 mutations.
- This way, no pruning is needed because there are no matches on the diagonal, and h(0,0) exactly equals the actual distance, so that only a very narrow region is expanded.
- B: Maximize the number of seeds that matches exactly (at most 10 times).
- Experiment: make one mutation every k positions, and make seeds of length k.
- Could be done by keeping a dynamic trie, only inserting positions in b once they fall within the cone, and removing then as soon as they leave the cone again.
- HashMap -> FxHashMap: a faster hash function for ints
- HashMap -> DiagonalMap: for expanded/explored states, since these are dense on the diagonal.
- BinaryHeap -> BucketHeap: much much faster; turns log(n) pop into O(1) push&pop
$k \geq log_Σ(n)$ -
$k \ll q/e$ , but by how much?$k\leq 3/4⋅ 1/e$ seems good? -> next theoretical paper.
- Expanded states plots
- Memory usage plots
Has to do with h0 being smaller
Inexact matches that can not occur as a result of greedy matching can be disregarded.
When pruning is slow, we can batch multiple prunes and wait untill the band becomes too large.
What if in the D-T method we do not allow leaving the path of a greedy match?
For CSH, we first put seeds in a map and then only store seeds matching a key. For SH, we currently make a map of all kmers of B, which is inefficient.
e | n | k (m=0) | k (m=1) | remark |
0.01 | 10k | 8+ | ||
0.01 | 100k | 10+ | ||
0.01 | 1M | 12+ | ||
0.05 | 10k | 9 - ~15 | ||
0.05 | 100k | 10 - ~15 | ||
0.05 | 1M | 12 - ~15 | ||
0.1 | 10k | 8 - 9 | 11 - 18 | m=1 30% slower |
0.1 | 100k | 9 - 10 | 12 - 18 | m=1 40% faster |
0.1 | 1M | * | 14 - 18 | |
0.2 | 10k | * | 10 (11) | |
0.2 | 100k | * | 11 | |
0.2 | 1M | * | * |
Parameter choice:
e | m | k | remark |
0.01 | 0 | 31 | |
0.05 | 0 | 14 | |
0.1 | 1 | 16 | for simplicity, fix m=1 |
0.2 | 1 | 11 |
e | n | k (m=0) | k (m=1) | remark |
0.01 | 10k | 8+ | ||
0.01 | 100k | 10+ | ||
0.01 | 1M | 12+ | ||
0.05 | 10k | 8 - ~16 | ||
0.05 | 100k | 9 - ~16 | ||
0.05 | 1M | 11 - ~16 | ||
0.1 | 10k | 8 - 9 | 11 - 18 | m=1 10% faster |
0.1 | 100k | * | 13 - 18 | |
0.1 | 1M | * | 15 - 18 | |
0.2 | 10k | * | 12 | |
0.2 | 100k | * | * | |
0.2 | 1M | * | * |
Parameter choice v1:
m | e | k | remark | |||
0 | 0.01 | 31 | ||||
0 | 0.05 | 14 | ||||
1 | 0.1 | 16 | for simplicity, fix m=1 | |||
1 | 0.2 | 11 | 12 is better at large n, but 11 consistent with CSH |
Parameter choice v2:
m | e | k | remark |
0 | <= 0.07 | 14 | works reasonably well everywhere |
1 | > 0.07 | 14 | 12 works better for larger e, 14 for larger n |
.cargo/config
:
[target.'cfg(any(windows, unix))'] rustflags = ["-C", "target-cpu=native", "-C", "llvm-args=-ffast-math", "-C", "opt-level=3", "-C", "remark=loop-vectorize", "-C", "debuginfo=2"]
Target function may be inlined elsewhere!
cargo asm --lib --rust --comments pairwise_aligner::aligners::nw::test
- inclusive scan (prefix min) may be useful to do col-wise NW faster:
- https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-scan-operations-explicit-vectorization.html#gs.3ym2aq
- https://stackoverflow.com/questions/36085271/horizontal-running-diff-and-conditional-update-using-simd-sse
- Blog post on exactly this recursion: https://matklad.github.io/2017/03/18/min-of-three-part-2.html
Flamegraphs after running on make flamegraphs
. (Download them for better interaction.)
Breakdown:
-
$e=5\%$ -
$9\%$ : finding all matches, -
$31\%$ : exploring edges, -
$\mathbf 21\%$ : traceback.
-
-
$e=15\%$ :-
$14\%$ : computing$h$ , -
$10\%$ : exploring edges, -
$\mathbf 60\%$ : hashmap indexing. Large memory is slower probably?
-
- Allow shifting everything less than a given position, even when a few positions remain constant.
- From the HintContours, return a pair
(Pos, shift)
: the bottom-right most positionp
for which all other positionsq <= p
are shifted by the given amountshift
.
- The last few layers could be stored separately, so that accessing the front is faster. Especially since this memory will remain hot, while indexing the larger hashmap may go to random parts of memory.
- When a potential optimal path is given, we can compute in advance which regions need to be computed.
- Together with storing matches in a vector per diagonal, this should make most indexing operations more predictable.
- Should save a bit of time.
- Using less memory by only storing positions where the traceback joins/splits should make the hashmap smaller, leading to faster operations.
- Hypotheses: it is sufficient to only store those states at the parent of a critical substitution edge.
- This way, computing the value of the heuristic anyway is trivial, and fewer datastructures need to be kept. The only numbers needed are the total number of seeds and the number of seeds starting before the given position.
- NOTE: This first requires reviving/re-implementing the dynamic seed choosing.
Instead of going from two sides, go from one side and keep the middle layer. For each position in the front keep the parent in the middle layer, so we can restart there.
All we need if that lemma 6 holds for some T’.
gap(x) = ax gapcost(u,v) = a * (|(i’-i) - (j’-j)|) <= P(u) - P(v) = seedcost T(i,j) = (a(i-j)-P, a(j-i)-P) (substitution cost doesn’t matter)
gap(x) = ax+b 2 options:
- gapcost(u,v) = 0
- gapcost(u,v) = a * (|(i’-i) - (j’-j)|) + b
We want T(u) <= T(v) equivalent to a((i’-i)-(j’-j)) + b <= P(u) - P(v) a(i-j)-P(u) + b <= a(i’-j’)-P(v) a(j-i)-P(u) + b <= a(j’-i’)-P(v)