Skip to content

Gene Signature Annotation Pipeline

Notifications You must be signed in to change notification settings

bedapub/sapiens

Repository files navigation

🐵 SAPIENS

Signature Annotation Pipeline for Entity Normalisation.

Brief Description

SAPIENS uses neural information retrieval to annotate gene signatures with classes from a structured ontology, e.g. some subclass of GO:biological_process or CL.

For fast retrieval, SAPIENS uses a lightweight CNN as an encoder, jointly embedding classes and signatures in the same latent space using text metadata, e.g. signature title and description. Results are fetched from a precomputed embedding index with milisec retrieval times.

Please see the wiki for instructions and information on how to setup SAPIENS, or go to SAPIENS_API for instructions on deployment.

Example Output

An example of the results over the C7 subset of MSigDB can be found here.

Disclaimer

SAPIENS has high retrieval accuracy but is imperfect. Some known issues are

  • Character-level sensitivity: due to the use of pre-trained token embeddings, the static vocabulary does not correctly segment long abbreviations or allow for character-level invariances to spelling mistakes. This should eventually be fixed by incorporating character-level embeddings.
  • NIL queries: if a relevant term does not exist within the structured vocabulary (GO or CL), then spurrious results may be returned. A NIL component should be incorporated into the pipeline eventually to account for this. Alternatively, more terms can be added from other vocabularies, but this may increase the sensitivity to noise in the query.

Contributors

sapiens

About

Gene Signature Annotation Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published