corrections for small samples #37

psyi · 2023-11-14T04:57:25Z

When handling a small number of sequences, Logomaker takes into account all characters from all columns, which generates less meaningful outputs.

Is there a way to add corrections for this?

For example, Weblogo (https://weblogo.berkeley.edu/logo.cgi) has an option for Small Sample Correction.

atareen · 2023-11-15T14:12:04Z

Hi, have you looked into Logomaker's pseudocount correction parameter in the function alignment_to_matrix(), when creating a matrix? Please take a look and let me know if that's the type of correction you meant.

logomaker/logomaker/src/matrix.py

Line 467 in 76aae02

def alignment_to_matrix(sequences,

psyi · 2023-11-16T05:55:30Z

Hi, atareen

Thanks for your reply. Setting pseudocount = 0 does partly solve my question. It forces probability calculation to use characters only in each column rather than including additional random (maybe?) characters. When the columns have few gaps, it worked well.

However, if a column in an aligned sequence has many gaps, it will generate an extremely high probability for the characters. Setting pseudocount = 0.1 or a higher number can reduce the probability, but it will include some other characters like the default.

Is there a way to calculate the probability for each character in a column by including the gaps, but without adding pseudocounts? I tried to look at the parameters in the logomaker.alignment_to_matrix function, but did not figure out a solution.

The Weblogo tool does not have this problem.

Hi, have you looked into Logomaker's pseudocount correction parameter in the function alignment_to_matrix(), when creating a matrix? Please take a look and let me know if that's the type of correction you meant.

logomaker/logomaker/src/matrix.py

Line 467 in 76aae02

def alignment_to_matrix(sequences,

atareen · 2023-11-17T03:23:10Z

Hi, I think I'll need to see an example regarding what you're asking for, with code some and synthetic/artificial data, to be able to help. Can you provide an example or notebook?

psyi · 2023-11-17T09:13:09Z

Hi, atareen

Please find the attached files. It has six files:

python code.txt contains the codes for generating a logo.
aa_seq.fasta is the aligned amino acid sequence I use as the input.
aa_logo_pseudocount=0/0.2/1.pdf are three logos generated by setting the pseudocount to 0, 0.2, and 1, respectively.

aa_WebLogo.png is generated by Weblogo (https://weblogo.berkeley.edu/logo.cgi) using the same sequence above.

Thank you.

files.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corrections for small samples #37

corrections for small samples #37

psyi commented Nov 14, 2023

atareen commented Nov 15, 2023

psyi commented Nov 16, 2023

atareen commented Nov 17, 2023

psyi commented Nov 17, 2023

corrections for small samples #37

corrections for small samples #37

Comments

psyi commented Nov 14, 2023

atareen commented Nov 15, 2023

psyi commented Nov 16, 2023

atareen commented Nov 17, 2023

psyi commented Nov 17, 2023