Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Sources Stats and Figure #71

Open
amc-corey-cox opened this issue Feb 21, 2024 · 5 comments
Open

Data Sources Stats and Figure #71

amc-corey-cox opened this issue Feb 21, 2024 · 5 comments
Assignees

Comments

@amc-corey-cox
Copy link
Collaborator

We'd like to recapitulate this figure, at least the stats:
Screenshot 2024-02-21 at 10 23 34 AM

This will likely require some re-tooling of the reports we use the generate the site. We may also need to re-architect the site a bit to allow for different QC/Stat views.

@amc-corey-cox
Copy link
Collaborator Author

Here is some additional context. Kevin was able to generate this table with the query below:

select category, namespace, count(*) from denormalized_nodes where category in ('biolink:Gene','biolink:Pathway') group by 1,2 having count(*) > 1 order by 1,2;
┌─────────────────┬───────────┬──────────────┐
│    category     │ namespace │ count_star() │
│     varchar     │  varchar  │    int64     │
├─────────────────┼───────────┼──────────────┤
│ biolink:Gene    │ FB        │        30284 │
│ biolink:Gene    │ HGNC      │        43840 │
│ biolink:Gene    │ MGI       │        79680 │
│ biolink:Gene    │ NCBIGene  │       196312 │
│ biolink:Gene    │ PomBase   │         5134 │
│ biolink:Gene    │ RGD       │        57146 │
│ biolink:Gene    │ SGD       │         7153 │
│ biolink:Gene    │ WB        │        48779 │
│ biolink:Gene    │ Xenbase   │        38732 │
│ biolink:Gene    │ ZFIN      │        38000 │
│ biolink:Gene    │ dictyBase │        14222 │
│ biolink:Pathway │ GO        │          645 │
│ biolink:Pathway │ Reactome  │        21441 │
├─────────────────┴───────────┴──────────────┤
│ 13 rows                          3 columns │
└────────────────────────────────────────────┘

Also from Kevin:

Part of why the edge counts get really weird and gross is that we sometimes name the primary source and sometimes name the aggregator

And here is another query and table:

primary_knowledge_source, count(*) from denormalized_edges where category not in ('biolink:Association','biolink:MacromolecularMachineToMolecularActivityAssociation', 'biolink:MacromolecularMachineToCellularComponentAssociation','biolink:MacromolecularMachineToBiologicalProcessAssociation') group by 1 having count(*) > 1 order by
┌──────────────────────────┬──────────────┐
│ primary_knowledge_source │ count_star() │
│         varchar          │    int64     │
├──────────────────────────┼──────────────┤
│ infores:bgee             │       436170 │
│ infores:biogrid          │      1336609 │
│ infores:flybase          │       407615 │
│ infores:hpo-annotations  │       554449 │
│ infores:mgi              │      1066490 │
│ infores:omim             │         7258 │
│ infores:orphanet         │         7997 │
│ infores:panther          │       551383 │
│ infores:pombase          │       168073 │
│ infores:reactome         │       251408 │
│ infores:rgd              │         9696 │
│ infores:sgd              │        16732 │
│ infores:string           │      1422026 │
│ infores:wormbase         │       130283 │
│ infores:xenbase          │         2232 │
│ infores:zfin             │       666695 │
├──────────────────────────┴──────────────┤
│ 16 rows                       2 columns │
└─────────────────────────────────────────┘

@kevinschaper
Copy link
Member

It turns out that we actually get clean tables with namespace/prefix for nodes and primary_knowledge_source for edges as long as we're filtering to these categories.

@monicacecilia
Copy link

And this one, please!
Screen Shot 2024-02-27 at 2 01 08 PM

@amc-corey-cox
Copy link
Collaborator Author

Here is another one.

Image

@monicacecilia
Copy link

sorry about the dupe! 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants