Skip to content

Latest commit

 

History

History
73 lines (51 loc) · 5.76 KB

README.md

File metadata and controls

73 lines (51 loc) · 5.76 KB

Mapping drug ids from Drugbank, STITCH, UMLS, KEGG, PubChem, ChEMBL and other databases:

This project includes various transformation tools that create and enrich a TSV file, which lists thousand of known drugs and all the available ids that could be found in drug databases.

In particular, we start from retrieving the drug information included in the latest Drugbank [1] (VERSION 5.1.8, RELEASED ON 2021-01-03) as well as in the latest Therapeutic Target Database [2] (VERSION 7.1.01, RELEASED ON 2019.07.14) in a file. We then enrich the drug fields by querying the following sources:

  • the web services API of ChEMBL Database [3][4]
  • the PUG REST API of PubChem Database [5]
  • the drugs file in the FTP server of the KEGG Database [6][7][8]
  • the UMLS Metathesaurus vocabulary Database[9], using the MetamorphoSys tool
  • the mapping files of the STITCH Database

Licence & Required Citation

For any use of the drug-mappings.tsv file in your work, a citation to the following paper is expected:

Aisopos, F., Paliouras, G. Comparing methods for drug–gene interaction prediction on the biomedical literature knowledge graph: performance versus explainability. BMC Bioinformatics 24, 272 (2023), DOI.

drug_id_mapping - NCSR Demokritos module Copyright 2021 Fotis Aisopos The Java code and TSV file are provided only for academic/research use and are licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: https://www.apache.org/licenses/LICENSE-2.0 .

Mapping TSV file data format

The resulting file (drug-mappings.tsv) includes a tab-separated entry for each drug, including multiple ids that could be found and crossed-checked from the aforementioned databases. For ids not found in none of the above sources, 'null' string is added. Multiple CUIs for a specific drug are separated with a comma separator(,). An example of the format of the TSV data file is as follows:

drugbankId	name	ttd_id	pubchem_cid	cas_num	chembl_id	zinc_id	chebi_id	kegg_cid	kegg_id	bindingDB_id	UMLS_cuis stitch_id
DB01149	Nefazodone	DAP000042	4449	83366-66-9	CHEMBL623	ZINC000000538065	7494	C07256	D08257	50069447	C0068485  CID000004449
DB01157	Trimetrexate	DAP000635	5583	52128-35-5	CHEMBL119	ZINC000000598852	9737	C11154	D06238	18268	C0085176  CID100005582
DB01248	Docetaxel	DAP000590	148124	114977-28-5	CHEMBL92	ZINC000085537053	4672	C11231	D02165	36351	C0246415,C0771375 CID100003143
DB02579	Acrylic Acid	D0E3MA	6581	79-10-7	CHEMBL1213529	ZINC000000895281	18308	C00511	null	null	null  null
...

Java Project File Structure & running

The code includes the basic package gr.demokritos.tranformations with various classes serving different functionalities, e.g.:

  • CreateDrugMappings class: main class, can be used to call all other classes of interest
  • xx_DrugbankMapper class: Maps the ids of xx Database to Drugbank
  • xxIdTransformer classes: Transforms the ids of xx Database to Drugbank and retrieve the respective UMLS_cuis
  • OpenXML class: Parses the Drugbank XML file and retrieves all information of interest
  • MetathesaurusAPIticketService class: creates a new TGT and API key, in order to query UMLS REST API (alternatively to using MetamorphoSys tool)

To run the aforementioned Java project, it is obvious that we need to have access to the following sources:

  • Drugbank (to download the latest XML file)
  • TTD (to download the drugs' information file in raw format)
  • Entrez Programming Utilities (E-utilities) API (query PUG for PubChem ids and obtain a token to query for a TGT and an API key)
  • KEGG (to download the KEGG drug file)
  • UniChem API (to query for ChEMBL ids)
  • DrugBank-Sider_mapping files (to query for STITCH ids) and also include needed jar libraries in the CLASSPATH.

References

[1]: Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P., ... & Woolsey, J. (2006). DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research, 34(suppl_1), D668-D672.

[2]: Y. X. Wang, S. Zhang, F. C. Li, Y. Zhou, Y. Zhang, R. Y. Zhang, J. Zhu, Y. X. Ren, Y. Tan, C. Qin, Y. H. Li, X. X. Li, Y. Z. Chen* and F. Zhu*. Therapeutic Target Database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic Acids Research. 48(D1): D1031-D1041 (2020). PubMed ID: 31691823

[3]: Mendez, D., Gaulton, A., Bento, A. P., Chambers, J., De Veij, M., Félix, E., ... & Leach, A. R. (2019). ChEMBL: towards direct deposition of bioassay data. Nucleic acids research, 47(D1), D930-D940.

[4]: Davies, M., Nowotka, M., Papadatos, G., Dedman, N., Gaulton, A., Atkinson, F., ... & Overington, J. P. (2015). ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic acids research, 43(W1), W612-W620.

[5]: Kim, S., Thiessen, P. A., Cheng, T., Yu, B., & Bolton, E. E. (2018). An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic acids research, 46(W1), W563-W570.

[6]: Kanehisa, M., & Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1), 27-30.

[7]: Kanehisa, M. (2019). Toward understanding the origin and evolution of cellular organisms. Protein Science, 28(11), 1947-1951.

[8]: Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., & Tanabe, M. (2021). KEGG: integrating viruses and cellular organisms. Nucleic acids research, 49(D1), D545-D551.

[9]: Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl_1), D267-D270.