Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: library and sample format in cas9.py #45

Closed
akshitasax opened this issue Apr 16, 2024 · 6 comments
Closed

Question: library and sample format in cas9.py #45

akshitasax opened this issue Apr 16, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@akshitasax
Copy link

Hello!

I am interested in using ScreenPro2/ScreenPro to analyze reads from a dual-guide CRISPR screen. Specifically, I wanted to get counts from my reads files and map them to my dual-guide library. I understand that user guidelines for the 'Fastq to counts' pipeline have not been posted to the readme yet, so I wanted to try to run the functions in Cas9.py on a jupyter notebook. What is the format of the dual-guide library that is given as an input in the map_sample_counts_to_library(library, sample) function? And is 'sample' the df_count returned by fastq_to_count_dual_guide()?

Thank you,
Akshita

@abearab abearab added the question Further information is requested label Apr 16, 2024
@abearab
Copy link
Collaborator

abearab commented Apr 16, 2024

Hello Akshita,

Thanks for your interest in using ScreenPro2, happy to respond to any questions!

I understand that user guidelines for the 'Fastq to counts' pipeline have not been posted to the readme yet, so I wanted to try to run the functions in Cas9.py on a jupyter notebook.

Sounds like a good plan! As part of #44, I'm actively working on more user friendly interface for ScreenPro2.

What is the format of the dual-guide library that is given as an input in the map_sample_counts_to_library(library, sample) function?

That's a good question! You need to have the have the library table and modify that based on your sequencing strategy. Can you tell me if what CRISPR library and sequencing protocol you are using? Is it from Replogle et al., eLife (2022) and in that case your library-prep protocol should be this, right? Once you clarify I can help you run the codes in jupyter.

And is 'sample' the df_count returned by fastq_to_count_dual_guide()?

That's true. This function is going to be changes in the next release. For the meantime, this is what I have done: I modified df_count to created a "sequence" column which would be matched with the "sequence" column in the library table. It depends on the sequencing strategy.


This is how I did fastq2count task for a dataset based on the Replogle et al., eLife (2022) protocol:

    df_count = cas9.fastq_to_count_dual_guide(
        R1_fastq_file_path=f'{fastq_dir}/{sample_id}_R1.fastq.gz',
        R2_fastq_file_path=f'{fastq_dir}/{sample_id}_R2.fastq.gz',
        trim5p_pos1_start=1,
        trim5p_pos1_length=19,
        trim5p_pos2_start=1,
        trim5p_pos2_length=19,
        verbose=verbose
    )

@akshitasax
Copy link
Author

Thanks for your response!

Our library is a dual-gRNA library, each protospacer being 20 bp, which we cloned using this protocol from the Weissman lab https://weissman.wi.mit.edu/resources/2022_crispri_protocols/Protocol_1_dual_sgRNA_lib_cloning.pdf
The library consists of 20 different guide pairs, without any unique indexing sequence for each pair.

We sequenced our samples via paired-end 250 NGS. And I already have trimmed R1 and R2 files that we processed using Geneious. So every read should already be trimmed to the 20 bp guide sequence that we are interested in counting. So, I was currently omitting the trimming arguments in the count function.

I was able to run the count function and obtain a df_count output. Let me know how the library table should be set up, what column names I should be using, etc. And let me know if any more information would help!

Thanks,
Akshita

@abearab
Copy link
Collaborator

abearab commented Apr 16, 2024

Thanks for your response!

Our library is a dual-gRNA library, each protospacer being 20 bp, which we cloned using this protocol from the Weissman lab https://weissman.wi.mit.edu/resources/2022_crispri_protocols/Protocol_1_dual_sgRNA_lib_cloning.pdf The library consists of 20 different guide pairs, without any unique indexing sequence for each pair.

We sequenced our samples via paired-end 250 NGS. And I already have trimmed R1 and R2 files that we processed using Geneious. So every read should already be trimmed to the 20 bp guide sequence that we are interested in counting. So, I was currently omitting the trimming arguments in the count function.

I was able to run the count function and obtain a df_count output. Let me know how the library table should be set up, what column names I should be using, etc. And let me know if any more information would help!

Thanks, Akshita

Okay, it make sense. If you already have your df_count, then that tells you the abundance for each sgRNA pairs. Now you need to map that to your library table. Your library table should have protospacer A/B sequence and you can concatenate them with a ; separator. I would name this new column "sequence" and make a similar column from df_count. See this example code:

    res = df_count.copy()
    res = res.sort('count', descending=True)
    res = res.with_columns(
        pl.concat_str(
            [
                pl.col("protospacer_a"),
                pl.col("protospacer_b")
            ],
            separator=";",
        ).alias("sequence"),
    )

    res_map = pl.DataFrame(library).join(
            res, on="sequence", how="left"
        )

** pl stands for polars


You may need to iterate at this step to make sure your "sequence" columns are defined properly (for instance, we don't sequence the first bp so we have 19bp in R1 / R2 instead of 20; yours can be a bit different). As a simple QC, you can do something like this

        print("% mapped reads",
            100 * \
            res_map.to_pandas()['count'].fillna(0).sum() / \
            int(res.select(pl.sum("count")).to_pandas()['count'])
        )

Best of luck.

@abearab
Copy link
Collaborator

abearab commented Apr 16, 2024

@akshitasax, in the meantime, if you don't mind please ⭐ the repo ;)

@abearab abearab self-assigned this Apr 16, 2024
@abearab
Copy link
Collaborator

abearab commented Apr 26, 2024

#47 discussed here.

@abearab
Copy link
Collaborator

abearab commented Apr 26, 2024

@akshitasax – I think you get your answers here, right? Fell free to open new issue / reopen this / or email me if you have other questions.

@abearab abearab closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants