KeyError when scoring interactions in CellPhoneDB #190

GUOYF0412 · 2024-05-23T02:32:19Z

Description

I encountered a KeyError when running the statistical analysis method in CellPhoneDB with interaction scoring enabled. I want to gracefully handle missing keys by returning np.nan instead of raising an error.

Steps to Reproduce

Load the anndata object with normalized counts and cell type metadata.
Generate the metadata file for CellPhoneDB.
Run the cpdb_statistical_analysis_method.call with the following parameters:

Load the anndata object with normalized counts and cell type metadata.
Generate the metadata file for CellPhoneDB.
Run the cpdb_statistical_analysis_method.call with the following parameters:

prepare data

import numpy as np
import pandas as pd
import scanpy as sc
import anndata
import glob
import os
import sys
from scipy import sparse

sc.settings.verbosity = 1  # verbosity: errors (0), warnings (1), info (2), hints (3)
sys.executable

# 1. Load anndata
adata = sc.read_h5ad('./write/adata.h5ad')
# 2. Generate meta
adata.obs['subtype'].values.describe()
df_meta = pd.DataFrame(data={'Cell':list(adata.obs.index),
                             'celltype':[ i for i in adata.obs['subtype']]
                            })
df_meta.set_index('Cell', inplace=True)
df_meta.to_csv('./write/adata_meta.tsv', sep = '\t')

cpdb_statistical_analysis_method.call

from IPython.display import HTML, display
from cellphonedb.utils import db_releases_utils

display(HTML(db_releases_utils.get_remote_database_versions_html()['db_releases_html_table']))

Version Release date
v5.0.0 2023-10-31
v4.1.0 2023-03-09

cpdb_version = 'v5.0.0'
cpdb_target_dir = os.path.join('./write/cellphoneDB_database', cpdb_version)
from cellphonedb.utils import db_utils
db_utils.download_database(cpdb_target_dir, cpdb_version)

meta_file_path = './write/adata_meta.tsv'
counts_file_path = './write/adata.h5ad'
cpdb_file_path = './write/cellphoneDB_database/v5.0.0/cellphonedb.zip'

following is the keyerror code

First Attempt

cpdb_results = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellphoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
    counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
   # active_tfs_file_path = active_tf_path,          # optional: defines cell types and their active TFs.
   # microenvs_file_path = microenvs_file_path,      # optional (default: None): defines cells per microenvironment.
    score_interactions = True,                       # optional: whether to score interactions or not. 
    iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 5,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 0.05,                                   # P-value threshold to employ for significance.
    subsampling = False,                             # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 1000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path =  "./write/cellphonedb_out",                          # Path to save results.
    output_suffix = "SMGs"                             # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )

[CORE][22/05/24-17:42:04][INFO] Scoring interactions: Calculating scores for all interactions and cell types..
100%|██████████| 1444/1444 [05:53<00:00, 4.08it/s]

KeyError Cell In[15], line 1
----> 1 cpdb_results 2 cpdb_file_path = cpdb_file_path, 3 meta_file_path = meta_file_path, 4 counts_file_path = counts_file_path, 5 counts_data = 'hgnc_symbol', 6 # active_tfs_file_path 7 # microenvs_file_path 8 score_interactions = True, 9 iterations = 1000, 10 threshold = 0.1, 11 threads = 5, 12 debug_seed = 42, 13 result_precision = 3, 14 pvalue = 0.05, 15 subsampling = False, 16 subsampling_log = False, 17 subsampling_num_pc = 100, 18 subsampling_num_cells = 1000, 19 separator = '|', 20 debug = False, 21 output_path = 22 output_suffix = "SMGs" 23 ) Traceback (most recent call last)
= cpdb_statistical_analysis_method.call(
# mandatory: CellphoneDB database zip file.
# mandatory: tsv file defining barcodes to cell label.
# mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
# defines the gene annotation in counts matrix.
= active_tf_path, # optional: defines cell types and their active TFs.
= microenvs_file_path, # optional (default: None): defines cells per microenvironment.
# optional: whether to score interactions or not.
# denotes the number of shufflings performed in the analysis.
# defines the min % of cells expressing a gene for this to be employed in the analysis.
# number of threads to use in the analysis.
# debug randome seed. To disable >=0.
# Sets the rounding for the mean values in significan_means.
# P-value threshold to employ for significance.
# To enable subsampling the data (geometri sketching).
# (mandatory) enable subsampling log1p for non log-transformed data inputs.
# Number of componets to subsample via geometric skectching (dafault: 100).
# Number of cells to subsample (integer) (default: 1/3 of the dataset).
# Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
# Saves all intermediate tables employed during the analysis in pkl format.
"./write/cellphonedb_out", # Path to save results.
# Replaces the timestamp in the output files by a user defined string in the (default: None).

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_method.py:157, in call(cpdb_file_path, meta_file_path, counts_file_path, counts_data, output_path, microenvs_file_path, active_tfs_file_path, iterations, threshold, threads, debug_seed, result_precision, pvalue, subsampling, subsampling_log, subsampling_num_pc, subsampling_num_cells, separator, debug, output_suffix, score_interactions)
154 if score_interactions:
155 # Make sure all cell types are strings
156 meta['cell_type'] = meta['cell_type'].apply(str)
--> 157 interaction_scores = scoring_utils.score_interactions_based_on_participant_expressions_product(
158 cpdb_file_path, counts4scoring, means_result.copy(), separator, meta, threshold, "cell_type", threads)
159 analysis_result['interaction_scores'] = interaction_scores
161 file_utils.save_dfs_as_tsv(output_path, output_suffix, "statistical_analysis", analysis_result)

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/utils/scoring_utils.py:344, in score_interactions_based_on_participant_expressions_product(cpdb_file_path, counts, means, separator, metadata, threshold, cell_type_col_name, threads)
340 cpdb_fms = scale_expression(cpdb_fmsh,
341 upper_range=10)
343 # Step 5: calculate the ligand-receptor score.
--> 344 interaction_scores = score_product(matrix=cpdb_fms,
345 means=means,
346 separator=separator,
347 interactions=interactions,
348 id2name=id2name,
349 threads=threads)
350 return interaction_scores

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/utils/scoring_utils.py:290, in score_product(matrix, interactions, means, separator, id2name, threads)
288 for ct_pair, lr_scores_filtered in results:
289 interacting_pair2score = dict(zip(lr_scores_filtered['interacting_pair'], lr_scores_filtered['score']))
--> 290 interaction_scores[ct_pair] = [interacting_pair2score[id] for id in interaction_scores['interacting_pair']]
292 return interaction_scores

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/utils/scoring_utils.py:290, in (.0)
288 for ct_pair, lr_scores_filtered in results:
289 interacting_pair2score = dict(zip(lr_scores_filtered['interacting_pair'], lr_scores_filtered['score']))
--> 290 interaction_scores[ct_pair] = [interacting_pair2score[id] for id in interaction_scores['interacting_pair']]
292 return interaction_scores

KeyError: 'COL11A1_integrin_a11b1_complex'

Proposed Solution:

I propose to modify the score_product function to handle missing keys gracefully by returning np.nan. Here is the modified score_product function:

from joblib import Parallel, delayed
import numpy as np
import pandas as pd

def score_interaction(ct_pair, matrix, interactions, id2name, separator):
    sub_matrix = matrix[[ct_pair]].copy()
    sub_matrix = sub_matrix.dropna()
    sub_matrix = sub_matrix.T
    scores = []

    for idx, row in sub_matrix.iterrows():
        interaction_id = row.name
        sub_prod = row.product()
        try:
            geom = np.power(sub_prod, 1 / len(row))
        except ValueError:
            geom = np.nan
        scores.append((interaction_id, geom))

    scores_filtered = pd.DataFrame(scores, columns=['interacting_pair', 'score'])
    return ct_pair, scores_filtered

def score_product(matrix, interactions, means, separator, id2name, threads):
    interaction_scores = means[['interacting_pair']].copy()
    interaction_scores = interaction_scores.assign(**{col: np.nan for col in matrix.columns})
    
    results = Parallel(n_jobs=threads, backend='multiprocessing')(
        delayed(score_interaction)(
            ct_pair, matrix, interactions, id2name, separator
        ) for ct_pair in matrix.columns
    )

    for ct_pair, lr_scores_filtered in results:
        interacting_pair2score = dict(zip(lr_scores_filtered['interacting_pair'], lr_scores_filtered['score']))
        scores = []
        for id in interaction_scores['interacting_pair']:
            try:
                scores.append(interacting_pair2score[id])
            except KeyError:
                scores.append(np.nan)
        interaction_scores[ct_pair] = scores

    return interaction_scores

KeyError: 'COL11A1_integrin_a11b1_complex'

Environment

Python version: 3.8
CellPhoneDB version: [specific version]
Operating System: [your operating system]

The text was updated successfully, but these errors were encountered:

GUOYF0412 · 2024-05-23T02:35:37Z

thanks for authors and scRNA-seq genius answering my questions,

cakirb · 2024-05-28T15:29:05Z

Hi @GUOYF0412,

Just to double-check, are you still having this issue since I couldn't be sure because the issue was closed and reopened.
If you're still having the issues, could you make sure that your data was normalised without z-scaling so the zeros in the matrix are not transformed?

Best,
Batu

GUOYF0412 · 2024-05-29T02:11:29Z

Hi @GUOYF0412,

Just to double-check, are you still having this issue since I couldn't be sure because the issue was closed and reopened. If you're still having the issues, could you make sure that your data was normalised without z-scaling so the zeros in the matrix are not transformed?

Best, Batu

thanks for your explanation, adata.X is here

and the adata is here

here are my harmony codes

def run_harmony(adata):
    # normalize
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    adata.raw = adata
    adata.layers['lognorm'] = adata.X.copy()
    # HVGs
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
    #  regress mt genes
    sc.pp.regress_out(adata, 'pct_counts_mt')
    
    # scale
    sc.pp.scale(adata)
    # regress cell cycle
    sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
    sc.pp.regress_out(adata, ['S_score', 'G2M_score'])
    
    # scale
    sc.pp.scale(adata)
    
    # pca
    sc.tl.pca(adata, svd_solver='arpack')
    sc.external.pp.harmony_integrate(adata, key='sample')
    
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, use_rep='X_pca_harmony')
    sc.tl.umap(adata)
    
    # res
    for i in [0.1, 0.2, 0.3, 0.5, 0.8, 1]:
        sc.tl.leiden(adata,resolution=i, key_added='leiden {}'.format(i))
    if not os.path.isdir('harmony'):
        os.mkdir('harmony')
    sc.pl.umap(adata,color=['leiden 0.1','leiden 0.2','leiden 0.3'],show = False)
    plt.savefig('harmony/resolution_low.pdf')
    sc.pl.umap(adata,color=['leiden 0.5','leiden 0.8','leiden 1'],show = False)
    plt.savefig('harmony/resolution_high.pdf')
    return adata

besides, I resolve code errors by modifying the source code
my new question is that there are so many warning when I running the analysis method
RuntimeWarning: invalid value encountered in power
geom = np.power(sub_prod, 1 / len(sub_values))

Xiongsq0720 · 2024-06-03T03:05:23Z

I also found an issue where score_interactions function works successfully in v4.1.0 and v5.0.0, but in v4.0.0 it also gives an error (keyerror), and when set to false, it runs successfully.

GUOYF0412 · 2024-06-26T13:31:31Z

I also found an issue where score_interactions function works successfully in v4.1.0 and v5.0.0, but in v4.0.0 it also gives an error (keyerror), and when set to false, it runs successfully.

perhaps the utlis file error, you could try to revise the original .py code. I did this and the score_interactions function works well!

GUOYF0412 · 2024-08-19T12:50:53Z

Probably the problems count for your adata count, which is the count data nor the log normalized data From: XiongSQ ***@***.***> Sent: Monday, June 3, 2024 11:06 AM To: ventolab/CellphoneDB ***@***.***> Cc: Jason GUO ***@***.***>; Mention ***@***.***> Subject: Re: [ventolab/CellphoneDB] KeyError when scoring interactions in CellPhoneDB (Issue #190) I also found an issue where score_interactions function works successfully in v4.1.0 and v5.0.0, but in v4.0.0 it also gives an error (keyerror), and when set to false, it runs successfully. — Reply to this email directly, view it on GitHub<#190 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZNZINDQ3MH7BNTV4M5MMHDZFPMQRAVCNFSM6AAAAABIEXUIJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGE4TGMRYHA>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

Xiongsq0720 · 2024-08-27T12:58:53Z

Probably the problems count for your adata count, which is the count data nor the log normalized data From: XiongSQ @.> Sent: Monday, June 3, 2024 11:06 AM To: ventolab/CellphoneDB @.> Cc: Jason GUO @.>; Mention @.> Subject: Re: [ventolab/CellphoneDB] KeyError when scoring interactions in CellPhoneDB (Issue #190) I also found an issue where score_interactions function works successfully in v4.1.0 and v5.0.0, but in v4.0.0 it also gives an error (keyerror), and when set to false, it runs successfully. — Reply to this email directly, view it on GitHub<#190 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AZNZINDQ3MH7BNTV4M5MMHDZFPMQRAVCNFSM6AAAAABIEXUIJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGE4TGMRYHA. You are receiving this because you were mentioned.Message ID: @.@.>>

Sorry for the delayed response, and thank you very much for your reply. However, my adata X is indeed a normalized count matrix. And running in two other versions of db is normal. QAQ

L-wang17 · 2024-09-26T03:00:49Z

Hi @GUOYF0412,

Just to double-check, are you still having this issue since I couldn't be sure because the issue was closed and reopened. If you're still having the issues, could you make sure that your data was normalised without z-scaling so the zeros in the matrix are not transformed?

Best, Batu

Hi@

Hi @GUOYF0412,

Just to double-check, are you still having this issue since I couldn't be sure because the issue was closed and reopened. If you're still having the issues, could you make sure that your data was normalised without z-scaling so the zeros in the matrix are not transformed?

Best, Batu

Hi @cakirb,

I have similar issue with same iterate RuntimeWarning: invalid value encountered in power
geom = np.power(sub_prod, 1 / len(sub_values)) and a KeyError.

I am using the newly updated CellphoneDB and database v5.0.0.

And it is not the dash "-"issue. I was thinking is it the input value problem? They are normalized and scaled because the 3 samples need to be pooled with harmony. I have no clue what is wrong.

GUOYF0412 closed this as completed May 23, 2024

GUOYF0412 reopened this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError when scoring interactions in CellPhoneDB #190

KeyError when scoring interactions in CellPhoneDB #190

GUOYF0412 commented May 23, 2024

GUOYF0412 commented May 23, 2024

cakirb commented May 28, 2024

GUOYF0412 commented May 29, 2024

Xiongsq0720 commented Jun 3, 2024

GUOYF0412 commented Jun 26, 2024

GUOYF0412 commented Aug 19, 2024 via email

Xiongsq0720 commented Aug 27, 2024

L-wang17 commented Sep 26, 2024 •

edited

Loading

KeyError when scoring interactions in CellPhoneDB #190

KeyError when scoring interactions in CellPhoneDB #190

Comments

GUOYF0412 commented May 23, 2024

Description

Steps to Reproduce

prepare data

cpdb_statistical_analysis_method.call

First Attempt

[CORE][22/05/24-17:42:04][INFO] Scoring interactions: Calculating scores for all interactions and cell types.. 100%|██████████| 1444/1444 [05:53<00:00, 4.08it/s]

Proposed Solution:

Environment

GUOYF0412 commented May 23, 2024

cakirb commented May 28, 2024

GUOYF0412 commented May 29, 2024

Xiongsq0720 commented Jun 3, 2024

GUOYF0412 commented Jun 26, 2024

GUOYF0412 commented Aug 19, 2024 via email

Xiongsq0720 commented Aug 27, 2024

L-wang17 commented Sep 26, 2024 • edited Loading

[CORE][22/05/24-17:42:04][INFO] Scoring interactions: Calculating scores for all interactions and cell types..
100%|██████████| 1444/1444 [05:53<00:00, 4.08it/s]

L-wang17 commented Sep 26, 2024 •

edited

Loading