Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent alignments depending on whether reference file is gzipped or not #70

Open
AaronRuben opened this issue Oct 7, 2024 · 7 comments

Comments

@AaronRuben
Copy link

Hi,

I am using mashmap v3.1.3 and I noticed that the same alignment was reported multiple times in the output file. For example, here is the output for reference chromosome 21:

ptg000091l	861542	0	861542	+	chr21	45090682	152523	1014065	13	861542	20	id:f:0.990785	kc:f:0.283563
ptg000217l	527266	0	527266	+	chr21	45090682	178973	706239	12	527266	20	id:f:0.990089	kc:f:0.331461
ptg000217l	527266	0	527266	+	chr21	45090682	178973	706239	12	527266	20	id:f:0.990089	kc:f:0.331461
ptg000203l	686331	0	686331	+	chr21	45090682	179063	865394	15	686331	25	id:f:0.996609	kc:f:0.235299
ptg000268l	916773	0	916773	+	chr21	45090682	582998	1499771	3	916773	12	id:f:0.93172	kc:f:0.0635271
ptg000268l	916773	0	916773	+	chr21	45090682	582998	1499771	3	916773	12	id:f:0.93172	kc:f:0.0635271
ptg000268l	916773	0	916773	+	chr21	45090682	582998	1499771	3	916773	12	id:f:0.93172	kc:f:0.0635271
ptg000145l	2623179	0	2623179	+	chr21	45090682	1133648	3756827	12	2623179	19	id:f:0.98662	kc:f:0.255438
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000158l	1222257	0	1222257	+	chr21	45090682	1565548	2787805	3	1222257	12	id:f:0.936192	kc:f:0.214653
ptg000160l	848270	0	848270	+	chr21	45090682	2308763	3157033	8	848270	16	id:f:0.972836	kc:f:0.583853
ptg000160l	848270	0	848270	+	chr21	45090682	2308763	3157033	8	848270	16	id:f:0.972836	kc:f:0.583853
ptg000151l	555938	0	555938	+	chr21	45090682	2459556	3015494	15	555938	22	id:f:0.993434	kc:f:1.23635
ptg000172l	1014256	0	1014256	+	chr21	45090682	2577878	3592134	10	1014256	17	id:f:0.980634	kc:f:0.597447
ptg000215l	318592	0	318592	+	chr21	45090682	5433633	5752225	19	318592	29	id:f:0.998634	kc:f:0.130072
ptg000025l	2865536	0	2865536	+	chr21	45090682	5830375	8695911	10	2865536	17	id:f:0.978886	kc:f:0.474346
ptg000176l	584981	0	584981	-	chr21	45090682	8297841	8882822	9	584981	16	id:f:0.977014	kc:f:0.599697
ptg000176l	584981	0	584981	+	chr21	45090682	8297841	8882822	9	584981	16	id:f:0.977014	kc:f:0.599697
ptg000176l	584981	0	584981	+	chr21	45090682	8297841	8882822	9	584981	16	id:f:0.977014	kc:f:0.599697
ptg000176l	584981	0	584981	+	chr21	45090682	8297841	8882822	9	584981	16	id:f:0.977014	kc:f:0.599697
ptg000093l	3796125	0	3796125	+	chr21	45090682	8542762	12338887	19	3796125	29	id:f:0.998634	kc:f:0.41517
ptg000125l	2794119	0	2794119	+	chr21	45090682	9318318	12112437	18	2794119	25	id:f:0.997158	kc:f:0.524907
ptg000290l	812527	0	812527	+	chr21	45090682	10614047	11426574	4	812527	13	id:f:0.945935	kc:f:0.117817
ptg000293l	206332	0	206332	+	chr21	45090682	10988834	11195166	10	206332	17	id:f:0.978886	kc:f:0.0164015
ptg000292l	229265	0	229265	+	chr21	45090682	11199120	11428385	11	229265	17	id:f:0.982112	kc:f:0.0476858
ptg000033l	27686851	0	27686851	+	chr21	45090682	24117979	45090681	20	27686851	255	id:f:1	kc:f:0.872569
ptg000115l	3037875	0	3037875	+	chr21	45090682	38473000	41510875	18	3037875	25	id:f:0.997158	kc:f:0.865375
ptg000147l	1146249	0	1146249	+	chr21	45090682	42104328	43250577	19	1146249	29	id:f:0.998634	kc:f:1.36605
ptg000047l	2315813	0	2315813	+	chr21	45090682	43904919	45090681	19	2315813	29	id:f:0.998634	kc:f:1.30252

I found this odd so I re-ran it. This time I happened to use an uncompressed version of my reference sequence file and I didn't get duplicated alignments, but I got some new alignments and the positions of previously found alignments changed. Here is again the output for reference chr21:

ptg000091l	861542	0	861542	+	chr21	45090682	173803	1035345	29	861542	22	id:f:0.993222	kc:f:0.33354
ptg000203l	686331	0	686331	+	chr21	45090682	179063	865394	30	686331	22	id:f:0.994209	kc:f:0.315678
ptg000145l	2623179	0	2623179	+	chr21	45090682	1170940	3794119	24	2623179	19	id:f:0.98662	kc:f:0.182862
ptg000151l	555938	0	555938	+	chr21	45090682	2499197	3055135	24	555938	19	id:f:0.98662	kc:f:0.900919
ptg000172l	1014256	0	1014256	+	chr21	45090682	2682656	3696912	27	1014256	20	id:f:0.990289	kc:f:0.642754
ptg000215l	318592	0	318592	+	chr21	45090682	5420070	5738662	38	318592	29	id:f:0.998634	kc:f:0.145981
ptg000025l	2865536	0	2865536	+	chr21	45090682	7220950	10086486	21	2865536	17	id:f:0.981403	kc:f:0.455276
ptg000176l	584981	0	584981	-	chr21	45090682	8381941	8966922	16	584981	16	id:f:0.971897	kc:f:0.557647
ptg000176l	584981	0	584981	+	chr21	45090682	8381941	8966922	16	584981	16	id:f:0.971897	kc:f:0.557647
ptg000176l	584981	0	584981	+	chr21	45090682	8381941	8966922	16	584981	16	id:f:0.971897	kc:f:0.557647
ptg000176l	584981	0	584981	+	chr21	45090682	8381941	8966922	16	584981	16	id:f:0.971897	kc:f:0.557647
ptg000125l	2794119	0	2794119	+	chr21	45090682	9374562	12168681	33	2794119	23	id:f:0.995431	kc:f:0.500231
ptg000293l	206332	0	206332	+	chr21	45090682	11091705	11298037	21	206332	17	id:f:0.980549	kc:f:0.0194184
ptg000292l	229265	0	229265	+	chr21	45090682	11199120	11428385	25	229265	19	id:f:0.986286	kc:f:0.0644872
ptg000033l	27686851	0	27686851	+	chr21	45090682	24229362	45090681	40	27686851	255	id:f:1	kc:f:1.00259
ptg000115l	3037875	0	3037875	+	chr21	45090682	39989460	43027335	37	3037875	27	id:f:0.997911	kc:f:0.951376
ptg000147l	1146249	0	1146249	+	chr21	45090682	42183625	43329874	38	1146249	29	id:f:0.998634	kc:f:1.11024
ptg000047l	2315813	0	2315813	+	chr21	45090682	43904919	45090681	38	2315813	29	id:f:0.998634	kc:f:1.22872

I used these commands:

mashmap --perc_identity 95 --noSplit -r hs1.fa.gz -q hifiasm.bp.unified.fa --threads 32 -o test1.mashmap
gunzip hs1.fa.gz
mashmap --perc_identity 95 --noSplit -r hs1.fa -q hifiasm.bp.unified.fa --threads 32 -o test.mashmap

Any ideas what could trigger such a behavior?

Thanks,
Aaron

@bkille
Copy link
Contributor

bkille commented Oct 9, 2024

Thanks for posting the issue, thats definitely not right... I'll take a look and see if I can replicate it. In the meantime, please feel free to post the stderr output of the two commands.

@AaronRuben
Copy link
Author

Hi, Thanks for the quick response. I did some debugging on my end, and the duplication of exactly the same alignment was a bug on my end (I had the same contig was included multiple times in the query fasta). However, even on a clean fasta, I do get different alignments depending on whether the reference is gzipped or not. I have the stderr attached.

Thanks again,
Aaron
compressed.stderr.txt
uncompressed.stderr.txt

@bkille
Copy link
Contributor

bkille commented Oct 14, 2024

Ahh! Ok that makes sense. As far as the different results based on the reference goes, it looks MashMap is finding twice as many unique k-mer seeds in the uncompressed reference as opposed to the compressed reference (19,972,584 vs 10,660,334).

  1. Can you confirm that the compressed and uncompressed reference files contain the same data? The following command should not output anything if they are identical:
diff <(gzip -dc hs1.fa.gz) hs1.fa
  1. In the case that they are identical, can you confirm that the issue persists even if you delete any index files (hs1.fa.gz.gzi, hs1.fa.gz.fai, and hs1.fa.fai)? Perhaps the index file was created for a previous version of the file and has not been updated.

@AaronRuben
Copy link
Author

Hi,

Re 1: Yes, I could verify that hs1.fa and hs1.fa.gz are identical

Re 2: I created copies of the reference files to and re-ran it mashmap but the issue persists. So I don't think lingering index files are the issue.

Unfortunately, I can't share my data, yet, but I think it might be an issue specific to my input sequence (it's a primary assembly from hifiasm). I tried to align the reference against itself using the uncompressed version as query and once the compressed and once the uncompressed version as reference, respectively. It still generates twice as many unique k-mer seeds in the case of using an uncompressed reference but it generates consistent alignments. I don't know if it's relevant but the k-mer complexities are different. I have attached stderr outputs and alignment outputs.
uncompressed.stderr.txt
uncompressed.mashmap.txt
compressed.mashmap.txt
compressed.stderr.txt

And the reference sequence is available from here:
https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz

Thanks,
Aaron

@bkille
Copy link
Contributor

bkille commented Oct 15, 2024

From looking at the output logs, the sketch size of the compressed version was 20, but the uncompressed was 40.

I was replicate the issue and actually, it looks like this has been around since MashMap2! Basically, in Section 5 of the original MashMap paper, they show that the sketch size which satisfies their constraints depends on the reference size. Since MashMap2, the raw file size has been used to determine the reference size.

The easiest hack would to just be to decompress the file twice (once to compute the reference length, then again to actually index it), but w/ a large file thats an extra 30 seconds for no good reason. Perhaps we'll just warn users that without a .fai/.gzi index, MashMap will have to compute the file size.

In the meantime, you can set the sketch size to 40 manually to ensure consistent results.

@bkille
Copy link
Contributor

bkille commented Oct 15, 2024

As a side note, with --noSplit, the sketch size is the same for each query sequence, i.e. even a query contig thats 10Mbp will only have 40 sketched k-mers. If you are using --noSplit, I'd recommend using a larger sketch size (maybe ~100) setting and a --segmentLength closer to the size of the smallest contig.

@AaronRuben
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants