Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate description source from gene description text #11

Closed
dhimmel opened this issue Dec 1, 2021 · 3 comments
Closed

Separate description source from gene description text #11

dhimmel opened this issue Dec 1, 2021 · 3 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Dec 1, 2021

Example gene descriptions by species:

  • human: tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
  • mouse: guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
  • rat: glutamate decarboxylase 1 [Source:RGD Symbol;Acc:2652]

Notice the trailing bracketed source information like "[Source:HGNC Symbol;Acc:HGNC:11858]". It would be nice to separate this description source information into a separate column, such that it's possible to isolate the actual description.

Question: is the source string always going to be in the format of [Source:SOURCE;Acc:CURIE] for all species and descriptions?

@dhimmel
Copy link
Member Author

dhimmel commented Dec 10, 2021

I'm looking to extract the gene_description source information in SQL, but when I use the REGEXP_SUBSTR mySQL function, I get the error:

ProgrammingError: (mysql.connector.errors.ProgrammingError) 1370 (42000): execute command denied to user 'anonymous'@'%' for routine 'homo_sapiens_core_105_38.REGEXP_SUBSTR'

Also, I don't think REGEXP_SUBSTR supports extracting matched groups.

Based on these issues, seems like we should parse the description in Python instead.

@dhimmel
Copy link
Member Author

dhimmel commented Dec 10, 2021

Noting that not all descriptions have source information. Here are some examples without:

  • GNAS complex locus (GNAS) pseudogene
  • novel transcript

There are also cases where gene_description is null.

dhimmel added a commit that referenced this issue Dec 10, 2021
@dhimmel
Copy link
Member Author

dhimmel commented Dec 10, 2021

Rerunning 105 exports in https://github.com/related-sciences/ensembl-genes/actions/runs/1564648697 to include gene description updates.

@dhimmel dhimmel closed this as completed Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant