Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pad the target when aligning #312

Open
wants to merge 59 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7c95af9
feat: Add target padding option for sequence alignment
ekg Dec 2, 2024
5a3ce52
refactor: Add target padding with bounds checking during mapping reco…
ekg Dec 2, 2024
d34c936
refactor: Remove target_padding from sequence record padding calculation
ekg Dec 2, 2024
abb0680
fix: Modify parseMashmapRow to accept target_padding as parameter
ekg Dec 2, 2024
971171a
fix: Update parseMashmapRow call with target_padding parameter
ekg Dec 2, 2024
32705e7
Merge branch 'main' of github.com:waveygang/wfmash into target-pad
ekg Jan 22, 2025
3f066a6
Merge branch 'main' of github.com:waveygang/wfmash into target-pad
ekg Jan 22, 2025
28e3527
feat: Add chain metadata parsing to MappingBoundaryRow
ekg Jan 22, 2025
19d7dc3
refactor: Remove duplicate MappingBoundaryRow struct and add chain me…
ekg Jan 22, 2025
0183028
feat: Add indel swizzling functions for CIGAR alignment optimization
ekg Jan 22, 2025
3946135
feat: Add wflign_swizzle.cpp to CMakeLists.txt source files
ekg Jan 22, 2025
a0cc4e4
feat: Add CIGAR conversion and swizzling for internal chain chunks
ekg Jan 22, 2025
7b071ad
fix: Add missing WFA header and edit_cigar_t type definition
ekg Jan 22, 2025
8a4f856
refactor: Update include paths and function signatures in wflign files
ekg Jan 22, 2025
f709f87
fix: Update WFA CIGAR handling and include paths in wflign module
ekg Jan 22, 2025
11d86a5
fix: Resolve WFA2 library compilation issues in wflign module
ekg Jan 22, 2025
c45e848
fix: Resolve include path and type definition errors in wflign module
ekg Jan 22, 2025
36ff9c8
refactor: Remove align_types.hpp dependency and simplify chain metada…
ekg Jan 22, 2025
47675d1
fix: Add missing header and update biWFA alignment call parameters
ekg Jan 22, 2025
f20ef73
fix: Update CIGAR conversion functions to use wflign_cigar_t
ekg Jan 22, 2025
eac77f1
fix: Define wflign_cigar_t and fix CIGAR conversion functions
ekg Jan 22, 2025
fff43e3
fix: Resolve namespace and function declaration conflicts in wflign_s…
ekg Jan 23, 2025
d161a92
fix: Resolve namespace and type compatibility issues in wflign CIGAR …
ekg Jan 23, 2025
033cc1b
fix: Resolve namespace and compilation issues in wflign module
ekg Jan 23, 2025
e4f1922
fix: Resolve duplicate function definitions and namespace issues in w…
ekg Jan 23, 2025
3b6347c
fix: Resolve namespace and type resolution issues in wflign CIGAR con…
ekg Jan 23, 2025
e49708b
refactor: Remove duplicate wflign_cigar_t definition and include wfli…
ekg Jan 23, 2025
c7110ca
fix: Resolve typedef and function declaration issues in wflign_swizzle
ekg Jan 23, 2025
00da2f8
fix: Update do_biwfa_alignment call with correct chain parameters
ekg Jan 23, 2025
8d9eb87
fix: Remove static from wflign namespace functions to resolve linker …
ekg Jan 23, 2025
b96e542
fix: Resolve conflicting function declarations in wflign_swizzle.hpp
ekg Jan 23, 2025
02cfe9f
refactor: Add debug logging for CIGAR swizzling in internal chain chunks
ekg Jan 23, 2025
ad6d5ef
feat: Add debug logging for chain swizzling in wflign alignment
ekg Jan 23, 2025
269fe10
feat: Add debug logging to track swizzling conditions in wflign align…
ekg Jan 23, 2025
025ec84
feat: Add chain information tracking in mapping generation
ekg Jan 23, 2025
caa9cd8
ok
ekg Jan 23, 2025
3f7a65b
feat: Add CIGAR conversion and swizzling for internal chain chunks
ekg Jan 23, 2025
7566965
fix: Implement `alignment_end_coords` in `wflign_swizzle.hpp`
ekg Jan 23, 2025
6b1ba44
fix: Resolve redefinition of alignment_end_coords function
ekg Jan 23, 2025
1417d55
fix: Remove redundant namespace qualifier in wflign_swizzle.cpp
ekg Jan 23, 2025
f8b32b4
feat: Enhance chain mapping output with position and length information
ekg Jan 23, 2025
93e2bf5
feat: Update chain parsing and target padding logic in wfmash
ekg Jan 23, 2025
d0d069d
refactor: Always attempt CIGAR swizzling for all alignment chunks
ekg Jan 23, 2025
881eecf
refactor: Update target padding logic to apply to all segments
ekg Jan 23, 2025
c97c8ce
feat: Add debug logging for chain info parsing in computeAlignments.hpp
ekg Jan 23, 2025
3759d2a
fix: Correct chain info parsing index in wfmash alignment
ekg Jan 23, 2025
f999313
refactor: Remove debug print statements from wflign and computeAlignm…
ekg Jan 23, 2025
d76b540
refactor: Improve indel swizzling robustness with enhanced debug logg…
ekg Jan 23, 2025
ade9571
fix: Add missing iostream and resolve function declaration ambiguity
ekg Jan 23, 2025
5b1c49c
fix: Remove default arguments from function implementations in wflign…
ekg Jan 23, 2025
5dbd289
feat: Add debug logging for CIGAR string swizzling in wflign
ekg Jan 23, 2025
5edaf84
feat: Add detailed debug logging for CIGAR string swizzling
ekg Jan 23, 2025
7dba9a2
fix: Convert CIGAR 'M' to '=' for exact matches in wflign
ekg Jan 23, 2025
16a1731
The commit message for this change would be:
ekg Jan 23, 2025
3b3a256
refactor: Reduce debug output to 100 characters for better readability
ekg Jan 23, 2025
7da3cfc
fix: Adjust target sequence coordinates in swizzle to use 0-based ind…
ekg Jan 23, 2025
c35bebf
the target and query offset are 0 and 0 in global alignment
ekg Jan 23, 2025
7f6b210
feat: Add functionality to trim leading and trailing deletions from a…
ekg Jan 23, 2025
1d9c923
feat: Add trim_deletions method to handle CIGAR string trimming
ekg Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/align/include/align_parameters.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ struct Parameters {
bool sam_format; //Emit the output in SAM format (PAF default)
bool no_seq_in_sam; //Do not fill the SEQ field in SAM format
bool multithread_fasta_input; //Multithreaded fasta input
uint64_t target_padding; //Additional padding around target sequence

#ifdef WFA_PNG_TSV_TIMING
// plotting
Expand Down
5 changes: 5 additions & 0 deletions src/align/include/align_types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ namespace align
skch::offset_t rEndPos; //mapping boundary end offset on ref
skch::strand_t strand; //mapping strand
float mashmap_estimated_identity;

// Chain metadata
int32_t chain_id{-1}; // Unique ID for this chain (-1 if not part of chain)
int32_t chain_length{1}; // Total segments in chain (1 if not part of chain)
int32_t chain_pos{1}; // Position in chain, 1-based (1 if not part of chain)
};

typedef std::unordered_map <std::string, std::string> refSequenceMap_t;
Expand Down
159 changes: 149 additions & 10 deletions src/align/include/computeAlignments.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,60 @@ typedef atomic_queue::AtomicQueue<std::string*, 1024, nullptr, true, true, false
/**
* @brief compute alignments
*/
/**
* @brief Trim leading and trailing deletions from CIGAR string and adjust coordinates
* @param[in] cigar The CIGAR string to trim
* @param[in] ref_start Reference start position
* @param[in] ref_end Reference end position
* @return Tuple of (trimmed_cigar, new_ref_start, new_ref_end)
*/
static std::tuple<std::string, uint64_t, uint64_t> trim_deletions(
const std::string& cigar,
uint64_t ref_start,
uint64_t ref_end) {

std::string trimmed_cigar;
uint64_t new_ref_start = ref_start;
uint64_t new_ref_end = ref_end;

// Parse CIGAR string
std::vector<std::pair<int, char>> cigar_ops;
size_t i = 0;
while (i < cigar.size()) {
int count = 0;
while (i < cigar.size() && std::isdigit(static_cast<unsigned char>(cigar[i]))) {
count = count * 10 + (cigar[i] - '0');
i++;
}
if (i < cigar.size()) {
cigar_ops.push_back({count, cigar[i]});
i++;
}
}

// Find leading deletions
size_t leading_dels = 0;
while (leading_dels < cigar_ops.size() && cigar_ops[leading_dels].second == 'D') {
new_ref_start += cigar_ops[leading_dels].first;
leading_dels++;
}

// Find trailing deletions
size_t trailing_dels = 0;
while (trailing_dels < cigar_ops.size() - leading_dels &&
cigar_ops[cigar_ops.size() - 1 - trailing_dels].second == 'D') {
new_ref_end -= cigar_ops[cigar_ops.size() - 1 - trailing_dels].first;
trailing_dels++;
}

// Build new CIGAR string without leading/trailing deletions
for (size_t j = leading_dels; j < cigar_ops.size() - trailing_dels; j++) {
trimmed_cigar += std::to_string(cigar_ops[j].first) + cigar_ops[j].second;
}

return std::make_tuple(trimmed_cigar, new_ref_start, new_ref_end);
}

void compute()
{
this->computeAlignments();
Expand All @@ -149,7 +203,7 @@ typedef atomic_queue::AtomicQueue<std::string*, 1024, nullptr, true, true, false
* @param[in] mappingRecordLine
* @param[out] currentRecord
*/
inline static void parseMashmapRow(const std::string &mappingRecordLine, MappingBoundaryRow &currentRecord) {
inline static void parseMashmapRow(const std::string &mappingRecordLine, MappingBoundaryRow &currentRecord, const uint64_t target_padding) {
std::stringstream ss(mappingRecordLine); // Insert the string into a stream
std::string word; // Have a buffer string

Expand All @@ -169,15 +223,54 @@ typedef atomic_queue::AtomicQueue<std::string*, 1024, nullptr, true, true, false
// if the estimated identity is missing, avoid assuming too low values
const float mm_id = wfmash::is_a_number(mm_id_vec.back()) ? std::stof(mm_id_vec.back()) : skch::fixed::percentage_identity;

// Parse chain info if present (expecting format "chain:i:id.pos.len" in tokens[14])
int32_t chain_id = -1;
int32_t chain_length = 1;
int32_t chain_pos = 1;
if (tokens.size() > 14) {
const vector<string> chain_vec = skch::CommonFunc::split(tokens[14], ':');
if (chain_vec.size() == 3 && chain_vec[0] == "chain" && chain_vec[1] == "i") {
// Split the id.pos.len format
const vector<string> chain_parts = skch::CommonFunc::split(chain_vec[2], '.');
if (chain_parts.size() == 3) {
chain_id = std::stoi(chain_parts[0]);
chain_pos = std::stoi(chain_parts[1]);
chain_length = std::stoi(chain_parts[2]);
}
}
}

// Save words into currentRecord
{
currentRecord.qId = tokens[0];
currentRecord.qStartPos = std::stoi(tokens[2]);
currentRecord.qEndPos = std::stoi(tokens[3]);
currentRecord.strand = (tokens[4] == "+" ? skch::strnd::FWD : skch::strnd::REV);
currentRecord.refId = tokens[5];
currentRecord.rStartPos = std::stoi(tokens[7]);
currentRecord.rEndPos = std::stoi(tokens[8]);
const uint64_t ref_len = std::stoi(tokens[6]);
currentRecord.chain_id = chain_id;
currentRecord.chain_length = chain_length;
currentRecord.chain_pos = chain_pos;

// Apply target padding while ensuring we don't go below 0 or above reference length
uint64_t rStartPos = std::stoi(tokens[7]);
uint64_t rEndPos = std::stoi(tokens[8]);

// Always apply target padding
if (target_padding > 0) {
if (rStartPos >= target_padding) {
rStartPos -= target_padding;
} else {
rStartPos = 0;
}
if (rEndPos + target_padding <= ref_len) {
rEndPos += target_padding;
} else {
rEndPos = ref_len;
}
}
currentRecord.rStartPos = rStartPos;
currentRecord.rEndPos = rEndPos;
currentRecord.mashmap_estimated_identity = mm_id;
}
}
Expand All @@ -193,7 +286,7 @@ seq_record_t* createSeqRecord(const MappingBoundaryRow& currentRecord,
// Get the query sequence length
const int64_t query_size = faidx_seq_len(query_faidx, currentRecord.qId.c_str());

// Compute padding
// Compute padding for sequence extraction
const uint64_t head_padding = currentRecord.rStartPos >= param.wflign_max_len_minor
? param.wflign_max_len_minor : currentRecord.rStartPos;
const uint64_t tail_padding = ref_size - currentRecord.rEndPos >= param.wflign_max_len_minor
Expand Down Expand Up @@ -273,7 +366,10 @@ std::string processAlignment(seq_record_t* rec) {
param.no_seq_in_sam,
param.min_identity,
param.wflign_max_len_minor,
rec->currentRecord.mashmap_estimated_identity);
rec->currentRecord.mashmap_estimated_identity,
rec->currentRecord.chain_id,
rec->currentRecord.chain_length,
rec->currentRecord.chain_pos);

return output.str();
}
Expand Down Expand Up @@ -310,7 +406,7 @@ void processor_thread(std::atomic<size_t>& total_alignments_queued,
std::string* line_ptr = nullptr;
if (line_queue.try_pop(line_ptr)) {
MappingBoundaryRow currentRecord;
parseMashmapRow(*line_ptr, currentRecord);
parseMashmapRow(*line_ptr, currentRecord, param.target_padding);

// Process the record and create seq_record_t
seq_record_t* rec = createSeqRecord(currentRecord, *line_ptr, local_ref_faidx, local_query_faidx);
Expand Down Expand Up @@ -408,9 +504,52 @@ void worker_thread(uint64_t tid,
if (seq_queue.try_pop(rec)) {
is_working.store(true);
std::string alignment_output = processAlignment(rec);

// Push the alignment output to the paf_queue
paf_queue.push(new std::string(std::move(alignment_output)));

// Parse the alignment output to find CIGAR string and coordinates
std::stringstream ss(alignment_output);
std::string line;
while (std::getline(ss, line)) {
if (line.empty()) continue;

std::vector<std::string> fields;
std::stringstream field_ss(line);
std::string field;
while (field_ss >> field) {
fields.push_back(field);
}

// Find the CIGAR string field (should be after cg:Z:)
auto cigar_it = std::find_if(fields.begin(), fields.end(),
[](const std::string& s) { return s.substr(0, 5) == "cg:Z:"; });

if (cigar_it != fields.end()) {
std::string cigar = cigar_it->substr(5); // Remove cg:Z: prefix
uint64_t ref_start = std::stoull(fields[7]);
uint64_t ref_end = std::stoull(fields[8]);

// Trim deletions and get new coordinates
auto [trimmed_cigar, new_ref_start, new_ref_end] = trim_deletions(cigar, ref_start, ref_end);

// Update the fields with new values
fields[7] = std::to_string(new_ref_start);
fields[8] = std::to_string(new_ref_end);
*cigar_it = "cg:Z:" + trimmed_cigar;

// Reconstruct the line
std::string new_line;
for (const auto& f : fields) {
if (!new_line.empty()) new_line += '\t';
new_line += f;
}
new_line += '\n';

// Push the modified alignment output to the paf_queue
paf_queue.push(new std::string(std::move(new_line)));
} else {
// If no CIGAR string found, output the line unchanged
paf_queue.push(new std::string(line + '\n'));
}
}

// Update progress meter and processed alignment length
uint64_t alignment_length = rec->currentRecord.qEndPos - rec->currentRecord.qStartPos;
Expand Down Expand Up @@ -514,7 +653,7 @@ void computeAlignments() {

while(std::getline(mappingListStream, mappingRecordLine)) {
if (!mappingRecordLine.empty()) {
parseMashmapRow(mappingRecordLine, currentRecord);
parseMashmapRow(mappingRecordLine, currentRecord, param.target_padding);
total_alignment_length += currentRecord.qEndPos - currentRecord.qStartPos;
}
}
Expand Down
1 change: 1 addition & 0 deletions src/common/wflign/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ set(wflign_SOURCE
src/wflign.cpp
src/wflign_alignment.cpp
src/wflign_patch.cpp
src/wflign_swizzle.cpp
src/rkmh.cpp
src/murmur3.cpp
)
Expand Down
44 changes: 42 additions & 2 deletions src/common/wflign/src/wflign.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

#include "wflign.hpp"
#include "wflign_patch.hpp"
#include "wflign_swizzle.hpp"


// Namespaces
Expand Down Expand Up @@ -34,7 +35,10 @@ void do_biwfa_alignment(
const bool no_seq_in_sam,
const float min_identity,
const uint64_t wflign_max_len_minor,
const float mashmap_estimated_identity) {
const float mashmap_estimated_identity,
const int32_t chain_id,
const int32_t chain_length,
const int32_t chain_pos) {

// Create WFA aligner with the provided penalties
wfa::WFAlignerGapAffine2Pieces wf_aligner(
Expand Down Expand Up @@ -64,7 +68,42 @@ void do_biwfa_alignment(

// Copy alignment CIGAR
wflign_edit_cigar_copy(wf_aligner, &aln.edit_cigar);

// Convert WFA CIGAR to string format for potential swizzling
std::string cigar_str = wfa_edit_cigar_to_string(aln.edit_cigar);
// Try swizzling the CIGAR at both ends with debug enabled
std::cerr << "[wflign::debug] Original CIGAR: " << cigar_str << std::endl;

std::string swizzled = try_swap_start_pattern(cigar_str, query, target, 0, 0);
if (swizzled != cigar_str) {
std::cerr << "[wflign::debug] Start pattern swizzled from " << cigar_str << " to " << swizzled << std::endl;
cigar_str = swizzled;
} else {
std::cerr << "[wflign::debug] Start pattern swizzle attempt failed" << std::endl;
}

swizzled = try_swap_end_pattern(cigar_str, query, target, 0, 0);
if (swizzled != cigar_str) {
std::cerr << "[wflign::debug] End pattern swizzled from " << cigar_str << " to " << swizzled << std::endl;
cigar_str = swizzled;
} else {
std::cerr << "[wflign::debug] End pattern swizzle attempt failed" << std::endl;
}
if (swizzled != cigar_str) {
cigar_str = swizzled;
}

// If the CIGAR changed, update coordinates and alignment
if (cigar_str != wfa_edit_cigar_to_string(aln.edit_cigar)) {
// Update coordinates based on swizzled CIGAR
auto new_coords = alignment_end_coords(cigar_str, query_offset, target_offset);
aln.j = new_coords.first; // new query start
aln.i = new_coords.second; // new target start

// Convert back to WFA format
wfa_string_to_edit_cigar(cigar_str, &aln.edit_cigar);
}

// Write alignment
if (paf_format_else_sam) {
write_alignment_paf(
Expand Down Expand Up @@ -451,7 +490,8 @@ void WFlign::wflign_affine_wavefront(
query_name, query, query_total_length, query_offset, query_length, query_is_rev,
target_name, target, target_total_length, target_offset, target_length,
*out, wfa_convex_penalties, emit_md_tag, paf_format_else_sam, no_seq_in_sam,
min_identity, wflign_max_len_minor, mashmap_estimated_identity);
min_identity, wflign_max_len_minor, mashmap_estimated_identity,
-1, 1, 1); // Not part of a chain when using direct biWFA
return;
}

Expand Down
5 changes: 4 additions & 1 deletion src/common/wflign/src/wflign.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,10 @@ namespace wflign {
const bool no_seq_in_sam,
const float min_identity,
const uint64_t wflign_max_len_minor,
const float mashmap_estimated_identity);
const float mashmap_estimated_identity,
const int32_t chain_id,
const int32_t chain_length,
const int32_t chain_pos);

class WFlign {
public:
Expand Down
55 changes: 55 additions & 0 deletions src/common/wflign/src/wflign_cigar_utils.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#ifndef WFLIGN_CIGAR_UTILS_HPP
#define WFLIGN_CIGAR_UTILS_HPP

#include <string>
#include <sstream>
#include "edit_cigar.hpp"

namespace wflign {

// Convert WFA edit_cigar_t to SAM-style CIGAR string
inline std::string wfa_edit_cigar_to_string(const edit_cigar_t& edit_cigar) {
std::stringstream cigar;
int count = 0;
char last_op = '\0';

for (int i = edit_cigar.begin_offset; i < edit_cigar.end_offset; ++i) {
const char op = edit_cigar.operations[i];
if (op == last_op) {
++count;
} else {
if (count > 0) {
cigar << count << last_op;
}
count = 1;
last_op = op;
}
}

if (count > 0) {
cigar << count << last_op;
}

return cigar.str();
}

// Convert SAM-style CIGAR string back to WFA edit_cigar_t
inline void wfa_string_to_edit_cigar(const std::string& cigar_str, edit_cigar_t* edit_cigar) {
edit_cigar_clear(edit_cigar);

int num = 0;
for (char c : cigar_str) {
if (std::isdigit(c)) {
num = num * 10 + (c - '0');
} else {
for (int i = 0; i < num; ++i) {
edit_cigar_add_operation(edit_cigar, c);
}
num = 0;
}
}
}

} // namespace wflign

#endif // WFLIGN_CIGAR_UTILS_HPP
Loading
Loading