Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrections for small samples #37

Open
psyi opened this issue Nov 14, 2023 · 4 comments
Open

corrections for small samples #37

psyi opened this issue Nov 14, 2023 · 4 comments

Comments

@psyi
Copy link

psyi commented Nov 14, 2023

When handling a small number of sequences, Logomaker takes into account all characters from all columns, which generates less meaningful outputs.

Is there a way to add corrections for this?

For example, Weblogo (https://weblogo.berkeley.edu/logo.cgi) has an option for Small Sample Correction.

@atareen
Copy link
Collaborator

atareen commented Nov 15, 2023

Hi, have you looked into Logomaker's pseudocount correction parameter in the function alignment_to_matrix(), when creating a matrix? Please take a look and let me know if that's the type of correction you meant.

def alignment_to_matrix(sequences,

@psyi
Copy link
Author

psyi commented Nov 16, 2023

Hi, atareen

Thanks for your reply. Setting pseudocount = 0 does partly solve my question. It forces probability calculation to use characters only in each column rather than including additional random (maybe?) characters. When the columns have few gaps, it worked well.

However, if a column in an aligned sequence has many gaps, it will generate an extremely high probability for the characters. Setting pseudocount = 0.1 or a higher number can reduce the probability, but it will include some other characters like the default.

Is there a way to calculate the probability for each character in a column by including the gaps, but without adding pseudocounts? I tried to look at the parameters in the logomaker.alignment_to_matrix function, but did not figure out a solution.

The Weblogo tool does not have this problem.

Hi, have you looked into Logomaker's pseudocount correction parameter in the function alignment_to_matrix(), when creating a matrix? Please take a look and let me know if that's the type of correction you meant.

def alignment_to_matrix(sequences,

@atareen
Copy link
Collaborator

atareen commented Nov 17, 2023

Hi, I think I'll need to see an example regarding what you're asking for, with code some and synthetic/artificial data, to be able to help. Can you provide an example or notebook?

@psyi
Copy link
Author

psyi commented Nov 17, 2023

Hi, atareen

Please find the attached files. It has six files:

python code.txt contains the codes for generating a logo.
aa_seq.fasta is the aligned amino acid sequence I use as the input.
aa_logo_pseudocount=0/0.2/1.pdf are three logos generated by setting the pseudocount to 0, 0.2, and 1, respectively.

aa_WebLogo.png is generated by Weblogo (https://weblogo.berkeley.edu/logo.cgi) using the same sequence above.

Thank you.

files.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants