Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for gene_models_main Files #31

Open
ctcncgr opened this issue Oct 9, 2023 · 1 comment
Open

Stats for gene_models_main Files #31

ctcncgr opened this issue Oct 9, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ctcncgr
Copy link
Member

ctcncgr commented Oct 9, 2023

We need to come up with some stats that we want to display for the gene_models files that are consumed.

Its pretty easy to do counts of field 3 and just report these, but should we do more?

@ctcncgr ctcncgr added the enhancement New feature or request label Oct 9, 2023
@adf-ncgr
Copy link
Contributor

Hi Connor-
I haven't thought about this too much, but I think it would be nice to have somewhat more comprehensive stats. Having glanced at/quick-tested a few ready-made options for getting something along these lines, I'm inclining towards agat_sp_statistics.pl which is part of https://agat.readthedocs.io/en/latest/index.html
Here's a snippet of output for the ann1 vs ann2 on arahy.Tifrunner.gnm2 (I just pasted the individual results to get a side by side comparison to satisfy my own curiosity):

Compute mrna with isoforms if any	Compute mrna with isoforms if any

Number of genes                              67005	Number of genes                              81717
Number of mrnas                              84519	Number of mrnas                              81717
Number of mrnas with utr both sides          40230	Number of mrnas with utr both sides          45192
Number of mrnas with at least one utr        58797	Number of mrnas with at least one utr        51453
Number of cdss                               84519	Number of cdss                               81717
Number of exons                              543827	Number of exons                              424835
Number of five_prime_utrs                    48482	Number of five_prime_utrs                    47678
Number of three_prime_utrs                   50545	Number of three_prime_utrs                   48967
Number of exon in cds                        501220	Number of exon in cds                        395401
Number of exon in five_prime_utr             69934	Number of exon in five_prime_utr             64811
Number of exon in three_prime_utr            71327	Number of exon in three_prime_utr            60798
Number of intron in cds                      416701	Number of intron in cds                      313684
Number of intron in exon                     459308	Number of intron in exon                     343118
Number of intron in five_prime_utr           21452	Number of intron in five_prime_utr           17133
Number of intron in three_prime_utr          20782	Number of intron in three_prime_utr          11831
Number gene overlapping                      2619	Number gene overlapping                      6252
Number of single exon gene                   4342	Number of single exon gene                   10607
Number of single exon mrna                   4342	Number of single exon mrna                   10607
mean mrnas per gene                          1.3	mean mrnas per gene                          1.0
mean cdss per mrna                           1.0	mean cdss per mrna                           1.0
mean exons per mrna                          6.4	mean exons per mrna                          5.2
mean five_prime_utrs per mrna                0.6	mean five_prime_utrs per mrna                0.6
mean three_prime_utrs per mrna               0.6	mean three_prime_utrs per mrna               0.6
mean exons per cds                           5.9	mean exons per cds                           4.8
mean exons per five_prime_utr                1.4	mean exons per five_prime_utr                1.4
mean exons per three_prime_utr               1.4	mean exons per three_prime_utr               1.2
mean introns in cdss per mrna                4.9	mean introns in cdss per mrna                3.8
mean introns in exons per mrna               5.4	mean introns in exons per mrna               4.2
mean introns in five_prime_utrs per mrna     0.3	mean introns in five_prime_utrs per mrna     0.2
mean introns in three_prime_utrs per mrna    0.2	mean introns in three_prime_utrs per mrna    0.1
Total gene length                            262875621	Total gene length                            303584458
Total mrna length                            352990253	Total mrna length                            303584458
Total cds length                             102403200	Total cds length                             89246482
Total exon length                            153915895	Total exon length                            130524008
Total five_prime_utr length                  19543492	Total five_prime_utr length                  16312458
Total three_prime_utr length                 31969203	Total three_prime_utr length                 24965068
Total intron length per cds                  180667164	Total intron length per cds                  150143918
Total intron length per exon                 199074358	Total intron length per exon                 173060450
Total intron length per five_prime_utr       10519071	Total intron length per five_prime_utr       13710523
Total intron length per three_prime_utr      7653779	Total intron length per three_prime_utr      8813277
mean gene length                             3923	mean gene length                             3715
mean mrna length                             4176	mean mrna length                             3715
mean cds length                              1211	mean cds length                              1092
mean exon length                             283	mean exon length                             307
mean five_prime_utr length                   403	mean five_prime_utr length                   342
mean three_prime_utr length                  632	mean three_prime_utr length                  509
...
Longest gene                                 342359	Longest gene                                 342359
Longest mrna                                 342359	Longest mrna                                 342359
Longest cds                                  16374	Longest cds                                  16272
Longest exon                                 14759	Longest exon                                 72007
Longest five_prime_utr                       15289	Longest five_prime_utr                       54108
Longest three_prime_utr                      15367	Longest three_prime_utr                      51521
Longest cds piece                            7977	Longest cds piece                            7977
Longest five_prime_utr piece                 14561	Longest five_prime_utr piece                 54108
Longest three_prime_utr piece                9844	Longest three_prime_utr piece                51521
Longest intron into cds part                 177377	Longest intron into cds part                 192003
Longest intron into exon part                177377	Longest intron into exon part                194085
Longest intron into five_prime_utr part      10997	Longest intron into five_prime_utr part      129751
Longest intron into three_prime_utr part     9921	Longest intron into three_prime_utr part     194085
Shortest gene                                163	Shortest gene                                102
Shortest mrna                                163	Shortest mrna                                102
Shortest cds                                 75	Shortest cds                                 78
Shortest exon                                3	Shortest exon                                1
Shortest five_prime_utr                      1	Shortest five_prime_utr                      1
Shortest three_prime_utr                     1	Shortest three_prime_utr                     1
Shortest cds piece                           1	Shortest cds piece                           1
Shortest five_prime_utr piece                1	Shortest five_prime_utr piece                1
Shortest three_prime_utr piece               1	Shortest three_prime_utr piece               1
Shortest intron into cds part                4	Shortest intron into cds part                4
Shortest intron into exon part               4	Shortest intron into exon part               4
Shortest intron into five_prime_utr part     5	Shortest intron into five_prime_utr part     17
Shortest intron into three_prime_utr part    12	Shortest intron into three_prime_utr part    18
...

arguably more info than we'd want to cram into a DSCensor report, but we could always be more selective about what we expose there.
One nice thing about this tool is that it seems to do a lot of inference about things like exons and introns even when they are not explicit in the file (e.g. if you have CDS and UTRs). Also, it doesn't seem to be as fussy as some tools about what it will require validation-wise.

We could (and probably should) bring others into the conversation about it, but wanted to at least give you something to chew on for starters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants