-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutational load function (SHM) #536
base: main
Are you sure you want to change the base?
Conversation
…nce and germline alignment
for more information, see https://pre-commit.ci
… function to api.rst
for more information, see https://pre-commit.ci
…cirpy into mutational_load
for more information, see https://pre-commit.ci
src/scirpy/tl/_mutational_load.py
Outdated
mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []} | ||
|
||
for row in range(len(airr_df)): | ||
fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do the numbers of the indices come from? Can we be sure they will remain stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These indices come from the IMGT unique numbering scheme (https://pubmed.ncbi.nlm.nih.gov/12477501/). This scheme is a standard approach to ensure that we can compare different V-regions of different cells. The neat thing is that sequences are aligned in a way that fwr 1-3 and cdr1-2 are always on the same spot in the germline and sequence alignment that's why these fixed indices work. cdr3 and fwr4 can be inferred by knowing the junction length and total sequence length as it is used in my code.
for more information, see https://pre-commit.ci
src/scirpy/tl/_mutational_load.py
Outdated
), | ||
} | ||
|
||
for v, coordinates in regions.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for v, coordinates in regions.items(): | |
for region, coordinates in regions.items(): |
One letter loop variables should only be used if they follow certain conventions, e.g. i/j/k
for counters in for loops,
or k, v
for key, value pairs from dict.items()
.
Since you use v
for the dict key, this can be confusing and I suggest to use a "proper" variable name like region here.
In terms of implementation, I think we're getting there :) |
…cirpy into mutational_load
for more information, see https://pre-commit.ci
…cirpy into mutational_load
Hi Gregor, For some reason pushing these changes seem to have broken something with MuData, but I have no idea why and what I could possibly have done to cause this 😢 The error massage seems to be everywhere the same: |
Breaking mudata is not your fault. It was caused by an anndata release and should be fixed by now. Just rerun the tests :) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #536 +/- ##
==========================================
+ Coverage 81.43% 81.75% +0.31%
==========================================
Files 49 50 +1
Lines 4213 4451 +238
==========================================
+ Hits 3431 3639 +208
- Misses 782 812 +30 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
sequence_alignment | ||
Awkward array key to access sequence alignment information | ||
germline_alignment | ||
Awkward array key to access germline alignment information -> best practice mask D-gene segment (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these information get there? Does dandelion / immcantation preprocessing populate these fields with "best practice" values? Does cellranger also provide them without any further processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these columns are populated by such tools like Dandelion and Immcantation. In fact, it's the result of re-annotation with igBLAST or imgt/highv-quest that is run "under the hood" of these tools (see also https://immcantation.readthedocs.io/en/stable/getting_started/10x_tutorial.html#assign-v-d-and-j-genes-using-igblast). Both dandelion and Immcantation follow the AIRR Community Standard, meaning that both "sequence_alignment" and "germline_alignment" should always result in the IMGT-gapped sequence (see also https://immcantation.readthedocs.io/en/stable/datastandards.html)
As far as I am aware does cellranger not provide it in this format, which is the main reason why we have to re-annotate cellranger output in the first place 😢
@MKanetscheider, two more considerations
LMK what you think! |
Unfortunately yes, if sequences are not IMGT aligned this whole function would be either non-functional any more or just returning some random nonsense 😢 I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths. However, I would also like to have a better safety net, but I couldn't come up with anything else so far...
I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format |
I think that's good enough then. After all it's also clearly stated in the documentation.
I don't think it would be an issue, because it will still be linear over the number of chains. The part Felix has been working on compares all-vs-all sequences, which is quadratic over the number of sequences and therefore a much harder problem. |
if num_chars == 0: | ||
return np.nan # can be used as a flag for filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this basically returns None when no (non-ignored) characters were compared. Could you please elaborate what's the reasoning behind this instead of returning 0?
I gave it a try in #573. I'd still need to deal with a bunch of edge cases, but I think the approach is viable. |
"cdr3": (312, 312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6), | ||
"fwr4": ( | ||
312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6, | ||
len(airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"]), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference for this calculation?
From Lefranc 2023:
rearranged CDR3-IMGT of 13 amino acids (and toa JUNCTION of 15 amino acids, 2nd-CYS 104 and J-TRP or J-PHE 118 being included in the JUNCTION definition). This numbering is convenient to use since 80% of the IMGT/LIGM-DBIG and TR rearranged sequences have a CDR3 IMGT length less than or equal to 13 amino acids (IMGT statistics, October 2001). For rearranged CDR3-IMGT less than 13 amino acids, gaps are created from the top of the loop, in the following order 111, 112, 110, 113, 109, 114, etc. (Table 4A). For rearranged CDR3-IMGT more than 13 amino acids, additional positions are created, between positions 111 and 112 at the top of the CDR3-IMGT loop, in the following order 112.1, 111.1, 112.2, 111.2, 112.3, 111.3, etc. (Table 4B).
From that text, I'd rather deduce something like
312 + max(junction_length, 13 * 3) # * 3 because 13 amino acids into nucleotides
But I haven't looked at the actual data myself.
Added mutational_load function to calculate differences between sequence and germline alignment. This is especially useful/insightful for BCR due to SHM and help to understand how much mutational actually occurred. However, this is a rather simple approach!
Closes #...