Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the use of the -ci parameter #198

Open
SUPERI-SAI opened this issue Sep 5, 2022 · 3 comments
Open

the use of the -ci parameter #198

SUPERI-SAI opened this issue Sep 5, 2022 · 3 comments

Comments

@SUPERI-SAI
Copy link

Hi,
I am very confused about the use of the -ci parameter. In the first step, in order to obtain the specific kmers of each genome, I used KMC to compare the two genomes, using the following command with the default -ci2:
kmc -t20 -k13 -fm -cs1000000000 Nepal.fa Nepal_out Nepal_temp
kmc -t20 -k13 -fm -cs1000000000 ZG.fasta ZG_out ZG_temp

here is the result:
ZG_specific.txt
Nepal_specific.txt

when i changed the "-ci":
kmc -t20 -k13 -fm -ci100 -cs1000000000 Nepal.fa Nepal_out Nepal_temp
kmc -t20 -k13 -fm -ci100 -cs1000000000 ZG.fasta ZG_out ZG_temp
the result is very strange:
Nepal_specific.txt
ZG_specific.part.txt.gz

As the function of "-ci" is: -ci - exclude k-mers occurring less than times (default: 2)
I thought that using "-ci100" would produce only a fraction of the "-ci2" result (kmers that produces more than 100 times), but it doesn't seem to. The result of "-CI100" is much larger than the result of "-CI2". Why does this happen?

Can you help me figure out where the error is?
Thanks!
YYY

@marekkokot
Copy link
Contributor

Hi,

Could you also send your input files and commands used to produce a textual representation of the output (was it with kmc_tools or with kmc_dump?
Is also the "total number of unique counted k-mer" reported by kmc nonsense or does it seems correct (I'm asking to determine whatever the bug is in the dump or in kmc itself)

@SUPERI-SAI
Copy link
Author

Hi,
Thanks for the quick reply.
Here are my commonds and the results(-ci2):

  1. kmc -t20 -k13 -fm -cs1000000000 Nepal.fa Nepal_out Nepal_temp
    kmc -t20 -k13 -fm -cs1000000000 ZG.fasta ZG_out ZG_temp
    results: (Some files were subcontracted due to file size limitations. To unpack successfully, you need to rename the subpackage file, eg: "nepal_out.kmc_suf.01.zip" to "nepal_out.kmc_suf.z01")
    Nepal_out.kmc_pre.zip
    Nepal_out.kmc_suf.01.zip
    Nepal_out.kmc_suf.02.zip
    Nepal_out.kmc_suf.zip
    ZG_out.kmc_pre.zip
    ZG_out.kmc_suf.01.zip
    ZG_out.kmc_suf.02.zip
    ZG_out.kmc_suf.03.zip
    ZG_out.kmc_suf.zip

  2. kmc_tools simple Nepal_out ZG_out kmers_subtract Nepal_specific reverse_kmers_subtract ZG_specific
    results:
    Nepal_specific.kmc_pre.zip
    Nepal_specific.kmc_suf.zip
    ZG_specific.kmc_pre.zip
    ZG_specific.kmc_suf.zip

  3. kmc_tools transform Nepal_specific dump Nepal_specific.txt
    kmc_tools transform ZG_specific dump ZG_specific.txt
    results:
    ZG_specific.txt
    Nepal_specific.txt

The log file is here:
kmc-ci2.log

The commonds and the results(-ci100):

  1. kmc -t20 -k13 -fm -ci100 -cs1000000000 Nepal.fa Nepal_out Nepal_temp
    kmc -t20 -k13 -fm -ci100 -cs1000000000 ZG.fasta ZG_out ZG_temp
    results:
    Nepal_out.kmc_suf.zip
    Nepal_out.kmc_pre.zip
    ZG_out.kmc_pre.zip
    ZG_out.kmc_suf.01.zip
    ZG_out.kmc_suf.zip

  2. kmc_tools simple Nepal_out ZG_out kmers_subtract Nepal_specific reverse_kmers_subtract ZG_specific
    results:
    Nepal_specific.kmc_suf.zip
    Nepal_specific.kmc_pre.zip
    ZG_specific.kmc_pre.zip
    ZG_specific.kmc_suf.zip

  3. kmc_tools transform Nepal_specific dump Nepal_specific.txt
    kmc_tools transform ZG_specific dump ZG_specific.txt
    results:
    Nepal_specific.txt
    ZG_specific.txt.01.zip
    ZG_specific.txt.zip

The log file is here:
kmc-ci100.log

Looking forward to your reply
Thanks!
YYY

@marekkokot
Copy link
Contributor

Hello,

sorry for the late response.
As I see you also perform some operations with kmc_tools.
I have also took a intersection to get the total number of commont k-mers.
Here are some stats:
ci2

ZG total: 33,473,284
Nepal total: 33,279,496
common k-mers: 33,233,048
Nepal specific total k-mers: 46,448
NG specific total k-mers: 240,236

ci100:

ZG total: 14,326,835
Nepal total: 5,391,748
common k-mers: 5,272,067
Nepal specific total k-mers: 119,681
NG specific total k-mers: 9,054,768

As you can see for ci2 total number of k-mers in data is similar (about 33M) and most of them are common, so the number of specific k-mers is relativelly low. For ci100 we have ~14M of k-mers per ZG but only ~5M for Nepal. Most of the Nepal k-mers occurs also in ZG (as the number of common k-mers is only slighly lower than the number of k-mers in Nepal).
For me these numbers seems reasonable. I mean if we have 14M items in one set and 5M items in another the difference between bigger and smaller cannot have less than 9M items. Does it make sense or am I missing something?

Best
Marek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants