Properties should not be added into KG2 on a whim. Every KG2 node and every KG2 edge has the same properties. We use TSV files as the basis for our Neo4j import. One of the requirements for TSV files is that every column be in line with the header. Thus, it is important that every KG2 node and every KG2 edge must contain all of the same properties in its respective TSV file. Due to the scope of KG2, storage costs are a real concern. Consequently, it is important that we don't add properties that won't be filled. Rows and rows of null values do not bring additional value to KG2 but increase our computational costs. Before adding properties, consider the following rules of thumb:
- A new property should be able to be filled by multiple sources.
- A new property should serve a defined role to the downstream system.
There are two general sections of the build code where properties can be added:
- Within the ETL scripts
- This is the standard place to add properties. If the value of a property comes from the source file/database, it should be assigned within the ETL scripts.
- When the graph is being filtered (
filter_kg_and_remap_predicates.py
)- This is a good place to add properties if they are based on a property assigned in an ETL script, but need to be assigned across ETL scripts. Currently, we assign predicates this way. There is a one-to-one mapping guide for source relations (ex.
HMDB:in_pathway
) to Biolink predicates (ex.biolink:participates_in
). This is useful because if a relation-predicate pair changes, it is fast and easy to rebuild. There is no need to rebuild a source file (akg2-{source}.json
file) if a mapping changes. Also, it prevents a relation (such asowl:sameAs
, which is assigned as a source relation in multiple scripts) from being mapped two different ways to a biolink predicate. This generally involves creating a YAML file that supplies mappings from the property assigned in the ETL script to the new property.
- This is a good place to add properties if they are based on a property assigned in an ETL script, but need to be assigned across ETL scripts. Currently, we assign predicates this way. There is a one-to-one mapping guide for source relations (ex.
- Update the
make_node
andmake_edge
functions inkg2_util.py
.- This ensures that nodes and edges are uniform throughout KG2.
- Currently, all nodes and edges must have the same properties (even if some of their values are null) because of the TSV import method we use for Neo4j.
- There are ways that we could restructure the build process and our Neo4j import method so that this is not necessary, but our current perspective is that the work required does not return a big enough reward. However, I will link instructions about how that could be done down the road here, in case that is desired in the future: https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin-import/#import-tool-syntax (see "Example 1").
Editing make_node
:
def make_node(id: str,
iri: str,
name: str,
category_label: str,
update_date: str,
provided_by: str):
if '-' in category_label:
raise ValueError('underscore character detected in category_label argument to function kg2_util.make_node: ' + category_label)
return {'id': id,
'iri': iri,
'name': name,
'full_name': name,
'category': convert_biolink_category_to_curie(category_label),
'category_label': category_label.replace(' ', '_'),
'description': None,
'synonym': [],
'publications': [],
'creation_date': None,
'update_date': update_date,
'deprecated': False,
'replaced_by': None,
'provided_by': provided_by,
- 'has_biological_sequence': None}
+ 'has_biological_sequence': None,
+ 'molecular_weight': None}
- A
molecular_weight
property with a default value ofNone
is added when nodes are created bymake_node
.
Editing make_edge
:
def make_edge(subject_id: str,
object_id: str,
relation_curie: str,
relation_label: str,
provided_by: str,
update_date: str = None):
edge = {'subject': subject_id,
'object': object_id,
'relation_label': relation_label,
'relation': relation_curie,
'negated': False,
'publications': [],
'publications_info': {},
'update_date': update_date,
- 'provided_by': provided_by}
+ 'provided_by': provided_by,
+ 'score': None}
edge_id = make_edge_key(edge)
edge["id"] = edge_id
return edge
- A
score
property with a default value ofNone
is added when edges are created bymake_edge
.
- Update ETL scripts based on new added properties.
- For an additional property to be worth the added costs associated with adding it, it is important that multiple sources contribute values to that new node/edge property
- In some cases, it is relatively easy to update the ETL scripts so that they can contribute values to the new property. This is usually the case in scripts that anticipated the addition of a property later on. For example,
intact_tsv_to_kg_json.py
anddisgenet_tsv_to_kg_json.py
already have score information extracted, because the person who did the ETL anticipated that a score property would be added into KG2 later. - In other cases, you may not have any idea if a source has information to contribute to a specific property. In these cases, you will have to go look in the source data file/database. This will both show you if a source has/doesn't have a property and, if so, how to extract it.
- For XML sources (ex. HMDB, DrugBank, PathWhiz files), I would recommend converting the file (or a subset of the file) to pretty-printed JSON to inspect it. This is often much easier than trying to sort through the raw XML.
- For TSV/CSV sources (ex. IntAct, DisGeNET, GO Annotations, NCBIGene, DGIdb, JensenLab), I would recommend converting the file into pretty-printed JSON by attaching header terms to their values in each line (creating a list of dictionaries, one for each row)
- For DAT sources (ex. UniProt, miRBase), I would recommend converting the file into a pretty-printed JSON file as well, using the functions built into those scripts
- For UMLS and the Ontologies, I wouldn't spend too much time trying to find new properties in that data. Due to the way we import the data (using ontobio), the data structure is pretty rigid and it would be difficult to extract anymore out of it.
- For SQL-based sources (ex. SemMedDB, Reactome, ChemBL, DrugCentral), I would recommend starting by inspecting their schema's on their respective websites. It takes quite a bit of time to learn a database, so it is important to weigh the costs vs benefits before diving into extracting more information out of these.
Example 1, disgenet_tsv_to_kg_json.py
-- adding score
edge property:
def make_edges(input_file: str, test_mode: bool):
edges = []
count = 0
non_befree_count = 0
with open(input_file, 'r') as input_tsv:
tsvreader = csv.reader(input_tsv, delimiter='\t')
for line in tsvreader:
count += 1
if count == 1:
continue
if test_mode and non_befree_count >= TEST_MODE_LIMIT:
break
[subject_id,
_,
_,
_,
object_id,
_,
_,
_,
_,
score,
evidence_score,
created_date,
update_date,
pmid,
source] = line
if source != 'BEFREE':
non_befree_count += 1
subject_id = format_id(subject_id,
kg2_util.CURIE_PREFIX_NCBI_GENE)
object_id = format_id(object_id,
kg2_util.CURIE_PREFIX_UMLS)
predicate = kg2_util.EDGE_LABEL_BIOLINK_GENE_ASSOCIATED_WITH_CONDITION
edge = kg2_util.make_edge_biolink(subject_id,
object_id,
predicate,
DISGENET_KB_CURIE,
update_date)
publication = kg2_util.CURIE_PREFIX_PMID + ':' + pmid
edge['publications'] = [publication]
+ edge['score'] = score
edges.append(edge)
return edges
Example 2, drugbank_xml_to_kg_json.py
-- adding molecular_weight
node property:
def format_node(drugbank_id: str,
description: str,
name: str,
update_date: str,
synonyms: list,
publications: list,
category_label: str,
creation_date: str,
- sequence: str):
+ sequence: str,
+ molecular_weight: str):
iri = DRUGBANK_BASE_IRI + drugbank_id
node_curie = kg2_util.CURIE_PREFIX_DRUGBANK + ":" + drugbank_id
node_dict = kg2_util.make_node(node_curie,
iri,
name,
category_label,
update_date,
DRUGBANK_KB_CURIE_ID)
node_dict["synonym"] = synonyms
node_dict["creation_date"] = creation_date
node_dict["description"] = description
node_dict["publications"] = publications
node_dict["has_biological_sequence"] = sequence
+ node_dict['molecular_weight'] = float(molecular_weight)
return node_dict
def format_edge(subject_id: str,
object_id: str,
predicate_label: str,
description: str,
publications: list = None):
relation_curie = kg2_util.predicate_label_to_curie(predicate_label,
DRUGBANK_RELATION_CURIE_PREFIX)
edge = kg2_util.make_edge(subject_id,
object_id,
relation_curie,
predicate_label,
DRUGBANK_KB_CURIE_ID,
None)
if description is not None:
edge["publications_info"] = {"sentence": description}
if publications is not None:
edge["publications"] = publications
return edge
def get_publications(references: list):
publications = []
if references is not None:
if references["articles"] is not None:
if references["articles"]["article"] is not None:
for publication in references["articles"]["article"]:
if isinstance(publication, dict) and \
publication["pubmed-id"] is not None:
publications.append(kg2_util.CURIE_PREFIX_PMID +
':' + publication["pubmed-id"])
return publications
def get_SMILES(calculated_properties: dict):
if calculated_properties is not None and isinstance(calculated_properties, dict):
properties = calculated_properties['property']
if properties is not None:
if isinstance(properties, list):
for property in properties:
if property['kind'] == "SMILES":
return property['value']
if isinstance(properties, dict):
if properties['kind'] == "SMILES":
return properties['value']
+
+
+def get_molecular_weight(experimental_properties: dict):
+ if experimental_properties is not None and isinstance(experimental_properties, dict):
+ properties = experimental_properties['property']
+ if properties is not None:
+ if isinstance(properties, list):
+ for property in properties:
+ if property['kind'] == 'Molecular Weight':
+ return property['value'], property['source']
+ if isinstance(properties, dict):
+ if properties['kind'] == 'Molecular Weight':
+ return properties['value'], properties['source']
+ return None, None
+
def make_node(drug: dict):
drugbank_id = get_drugbank_id(drug)
synonyms = []
drug_type = drug["@type"]
if drug_type == TYPE_SMALL_MOLECULE:
category = kg2_util.BIOLINK_CATEGORY_SMALL_MOLECULE
elif drug_type == TYPE_BIOTECH:
category = kg2_util.BIOLINK_CATEGORY_MOLECULAR_ENTITY
else:
print(f"Unknown type: {drug_type} for drug ID: {drugbank_id}; treating as chemical entity",
file=sys.stderr)
category = kg2_util.BIOLINK_CATEGORY_CHEMICAL_ENTITY
if drug["synonyms"] is not None:
if drug["synonyms"]["synonym"] is not None:
for synonym in drug["synonyms"]["synonym"]:
if isinstance(synonym, dict):
synonyms.append(synonym["#text"])
publications = get_publications(drug["general-references"])
smiles = get_SMILES(drug.get('calculated-properties', None)) # Per Issue #1273, if desired down the road
+ molecular_weight, molecular_weight_source = get_molecular_weight(drug.get('experimental-properties', None))
node = None
description = drug["description"]
if description is not None:
description = description.replace('\n', ' ').replace('\r', ' ')
if len(drugbank_id) > 0:
node = format_node(drugbank_id=drugbank_id,
description=description,
name=drug["name"],
update_date=drug["@updated"],
synonyms=synonyms,
publications=publications,
category_label=category,
creation_date=drug["@created"],
- sequence=smiles)
+ sequence=smiles,
+ molecular_weight=molecular_weight)
return node
- Update
kg2_util.py
merge dictionary code (if adding a node property) orfilter_kg_and_remap_predicates.py
merge code (if adding an edge property).
Example 1, node merging:
def merge_two_dicts(x: dict, y: dict, biolink_depth_getter: callable = None):
ret_dict = copy.deepcopy(x)
for key, value in y.items():
stored_value = ret_dict.get(key, None)
if stored_value is None:
if value is not None:
ret_dict[key] = value
else:
if value is not None and value != stored_value:
if type(value) == str and type(stored_value) == str:
if value.lower() != stored_value.lower():
if key == 'update_date':
# Use the longer of the two update-date fields
# NOTE: this is not ideal; better to have actual
# dates (and not strings) so we can use the
# most recent date (see issue #980)
if len(value) > len(stored_value):
ret_dict[key] = value
elif key == 'description':
ret_dict[key] = stored_value + '; ' + value
elif key == 'ontology node type':
log_message("warning: for key: " + key + ", dropping second value: " + value + '; keeping first value: ' + stored_value,
output_stream=sys.stderr)
ret_dict[key] = stored_value
elif key == 'provided_by':
if value.endswith('/STY'):
ret_dict[key] = value
elif key == 'category_label':
if biolink_depth_getter is not None:
depth_x = biolink_depth_getter(CURIE_PREFIX_BIOLINK + ':' + convert_snake_case_to_camel_case(stored_value, uppercase_first_letter=True))
depth_y = biolink_depth_getter(CURIE_PREFIX_BIOLINK + ':' + convert_snake_case_to_camel_case(value, uppercase_first_letter=True))
if depth_y is not None:
if depth_x is not None:
if depth_y > depth_x:
ret_dict[key] = value
else:
ret_dict[key] = value
else:
if 'named_thing' != value:
if stored_value == 'named_thing':
ret_dict[key] = value
else:
log_message(message="inconsistent category_label information; keeping original category_label " + stored_value +
" and discarding new category_label " + value,
ontology_name=str(x.get('provided_by', 'provided_by=UNKNOWN')),
node_curie_id=x.get('id', 'id=UNKNOWN'),
output_stream=sys.stderr)
continue
elif key == 'category':
if biolink_depth_getter is not None:
depth_x = biolink_depth_getter(stored_value)
depth_y = biolink_depth_getter(value)
if depth_y is not None:
if depth_x is not None:
if depth_y > depth_x:
ret_dict[key] = value
else:
ret_dict[key] = value
else:
if not value.endswith('NamedThing'):
if stored_value.endswith('NamedThing'):
ret_dict[key] = value
else:
log_message(message="inconsistent category information; keeping original category " + stored_value +
" and discarding new category " + value,
ontology_name=str(x.get('provided_by', 'provided_by=UNKNOWN')),
node_curie_id=x.get('id', 'id=UNKNOWN'),
output_stream=sys.stderr)
continue
elif key == 'name' or key == 'full_name':
if value.replace(' ', '_') != stored_value.replace(' ', '_'):
stored_desc = ret_dict.get('description', None)
new_desc = y.get('description', None)
if stored_desc is not None and new_desc is not None:
if len(new_desc) > len(stored_desc):
ret_dict[key] = value
elif new_desc is not None:
ret_dict[key] = value
log_message(message='Warning: for ' + x.get('id', 'id=UNKNOWN') + ' original name of ' + stored_value +
' is being overwritten to ' + value,
output_stream=sys.stderr)
elif key == 'has_biological_sequence':
if stored_value is None and value is not None:
ret_dict[key] = value
+ elif key == 'molecular_weight':
+ # I'm not sure if this is exactly what we'd want to happen in a merge for this. This is strictly an example of the type of edits that need to be made.
+ if stored_value is None and value is not None:
+ ret_dict[key] = value
+ elif stored_value is not None and value is not None and stored_value != value:
+ if int(stored_value) == int(value):
+ ret_dict[key] = int(stored_value)
+ else:
+ ret_dict[key] = None
else:
log_message("warning: for key: " + key + ", dropping second value: " + value + '; keeping first value: ' + stored_value,
output_stream=sys.stderr)
elif type(value) == list and type(stored_value) == list:
if key != 'synonym':
ret_dict[key] = sorted(list(set(value + stored_value)))
else:
if len(stored_value) > 0:
first_element = {stored_value[0]}
elif len(value) > 0 and len(stored_value) == 0:
first_element = {value[0]}
else:
first_element = set()
ret_dict[key] = list(first_element) + sorted(filter(None, list(set(value + stored_value) - first_element)))
elif type(value) == list and type(stored_value) == str:
ret_dict[key] = sorted(list(set(value + [stored_value])))
elif type(value) == str and type(stored_value) == list:
ret_dict[key] = sorted(list(set([value] + stored_value)))
elif type(value) == dict and type(stored_value) == dict:
ret_dict[key] = merge_two_dicts(value, stored_value, biolink_depth_getter)
elif key == 'deprecated' and type(value) == bool:
ret_dict[key] = True # special case for deprecation; True always trumps False for this property
else:
log_message(message="invalid type for key: " + key,
ontology_name=str(x.get('provided_by', 'provided_by=UNKNOWN')),
node_curie_id=x.get('id', 'id=UNKNOWN'),
output_stream=sys.stderr)
assert False
return ret_dict
Example 2, edge merging:
existing_edge = new_edges.get(edge_key, None)
if existing_edge is not None:
existing_edge['provided_by'] = sorted(list(set(existing_edge['provided_by'] + edge_dict['provided_by'])))
existing_edge['publications'] += edge_dict['publications']
existing_edge['publications_info'].update(edge_dict['publications_info'])
+ existing_edge['score'] = (existing_edge['score'] + edge_dict['score']) / 2 # This is an example, not necessarily how score merging should be handled
else:
new_edges[edge_key] = edge_dict
- Update
kg_json_to_tsv.py
.
- If the new property is of a list type, make sure to accomodate accordingly in
kg_json_to_tsv.py
- The header should contain the
[]
symbol after the type (ex.synonym:string[]
) to indicate that it is an array. - The Python list should be turned into a semi-colon delimited list (see the
--array-delimiter
command line option forneo4j-admin import
) using code like this:str(value).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "")
. See below for it in use:
elif key == "synonym": value = truncate_node_synonyms_if_too_large(node[key], node['id']) value = str(value).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "") elif key == "publications": value = str(node[key]).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "")
- The header should contain the
Example 1, updating edge properties:
def check_all_edges_have_same_set(edgekeys_list):
"""
:param edgekeys_list: A list containing keys for an edge
"""
# Supported_ls is a list of properties that edges can have
supported_ls = ["relation_label",
"negated",
"object",
"provided_by",
"publications",
"publications_info",
"relation",
"subject",
"update_date",
"predicate",
"predicate_label",
- "id"]
+ "id",
+ "score"]
for edgelabel in edgekeys_list:
if edgelabel not in supported_ls:
raise ValueError("relation_label not in supported list: " + edgelabel)
...
def edges(graph, output_file_location):
"""
:param graph: A dictionary containing KG2
:param output_file_location: A string containing the path to the
TSV output directory
"""
# Generate list of output file names for the edges TSV files
edges_file = output_files(output_file_location, "edges")
# Create dictionary of edges from KG2
edges = graph["edges"]
# Open output TSV files
tsvfile = open(edges_file[0], 'w+')
tsvfile_h = open(edges_file[1], 'w+')
# Set loop (edge counter) to zero
loop = 0
# Set up TSV files to be written to
tsvwrite = tsv.writer(tsvfile, delimiter="\t", quoting=tsv.QUOTE_MINIMAL)
tsvwrite_h = tsv.writer(tsvfile_h, delimiter="\t",
quoting=tsv.QUOTE_MINIMAL)
for edge in edges:
# Inrease edge counter by one each loop
loop += 1
# Add all edge property label to a list in the same order and test
# to make sure they are the same
edgekeys = list(sorted(edge.keys()))
check_all_edges_have_same_set(edgekeys)
# Add an extra property of "predicate" to the list so that predicates
# can be a property and a label
edgekeys.append('predicate')
edgekeys.append('subject')
edgekeys.append('object')
# Create list for values of edge properties to be added to
vallist = []
for key in edgekeys:
# Add the value for each edge property to the value list
# and limit the size of the publications_info dictionary
# to avoid Neo4j buffer size error
value = edge[key]
if key == "publications_info":
value = limit_publication_info_size(key, value)
elif key == 'provided_by':
value = str(value).replace("', '", "; ").replace("['", "").replace("']", "")
elif key == 'relation_label': # fix for issue number 473 (hyphens in relation_labels)
value = value.replace('-', '_').replace('(', '').replace(')', '')
elif key == 'publications':
value = str(value).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "")
vallist.append(value)
# Add the edge property labels to the edge header TSV file
# But only for the first edge
if loop == 1:
edgekeys = no_space('provided_by', edgekeys, 'provided_by:string[]')
edgekeys = no_space('predicate', edgekeys, 'predicate:TYPE')
edgekeys = no_space('subject', edgekeys, ':START_ID')
edgekeys = no_space('object', edgekeys, ':END_ID')
edgekeys = no_space('publications', edgekeys, "publications:string[]")
+ edgekeys = no_space('score', edgekeys, 'score:float')
tsvwrite_h.writerow(edgekeys)
tsvwrite.writerow(vallist)
# Close the TSV files to prevent a memory leak
tsvfile.close()
tsvfile_h.close()
Example 2, updating node properties
def nodes(graph, output_file_location):
"""
:param graph: A dictionary containing KG2
:param output_file_location: A string containing the
path to the TSV output directory
"""
# Generate list of output file names for the nodes TSV files
nodes_file = output_files(output_file_location, "nodes")
# Create dictionary of nodes from KG2
nodes = graph["nodes"]
# Open output TSV files
tsvfile = open(nodes_file[0], 'w+')
tsvfile_h = open(nodes_file[1], 'w+')
# Set loop (node counter) to zero
loop = 0
# Set up TSV files to be written to
tsvwrite = tsv.writer(tsvfile, delimiter="\t",
quoting=tsv.QUOTE_MINIMAL)
tsvwrite_h = tsv.writer(tsvfile_h, delimiter="\t",
quoting=tsv.QUOTE_MINIMAL)
# Set single loop to zero and get list of node properties, which will go
# in the header, to compare other nodes to
single_loop = 0
for node in nodes:
single_loop += 1
if single_loop == 1:
nodekeys_official = list(sorted(node.keys()))
nodekeys_official.append("category")
for node in nodes:
# Inrease node counter by one each loop
loop += 1
# Add all node property labels to a list in the same order
nodekeys = list(sorted(node.keys()))
nodekeys.append("category")
# Create list for values of node properties to be added to
vallist = []
# Set index in list of node properties to zero, to be iterated through
key_count = 0
for key in nodekeys:
if key != nodekeys_official[key_count]:
# Add a property from the header list of node properties
# if it doesn't exist and make the value for that property " "
nodekeys.insert(key_count, nodekeys_official[key_count])
value = " "
elif key == "synonym":
value = truncate_node_synonyms_if_too_large(node[key], node['id'])
value = str(value).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "")
elif key == "publications":
value = str(node[key]).replace("', '", "; ").replace("'", "").replace("[", "").replace("]", "")
elif key == "description" and node[key] is not None:
value = shorten_description_if_too_large(node[key], node['id'])
else:
# If the property does exist, assign the property value
value = node[key]
# Add the value of the property to the property value list
vallist.append(value)
# Increase the index count by one
key_count += 1
# Add the edge property labels to the edge header TSV file
# But only for the first edge
if loop == 1:
nodekeys = no_space('id', nodekeys, 'id:ID')
nodekeys = no_space('publications', nodekeys, "publications:string[]")
nodekeys = no_space('synonym', nodekeys, "synonym:string[]")
nodekeys = no_space('category', nodekeys, ':LABEL')
+ nodekeys = no_space('molecular_weight', nodekeys, 'molecular_weight:float')
tsvwrite_h.writerow(nodekeys)
tsvwrite.writerow(vallist)
# Close the TSV files to prevent a memory leak
tsvfile.close()
tsvfile_h.close()
For these steps, I am going to go through what the process would be to add a property after the ETL steps. As an example, I am going to use adding a knowledge_source
node and edge property. While this may seem specific, the general process should be similar for other properties added post-ETL.
- Make a mapping file between an existing property or existing properties to the new property/properties.
This will likely be a YAML file. Below is an example snippet for adding the
knowledge_source
property:
DGIdb:
infores_curie: infores:dgidb
knowledge_type: aggregator_knowledge_source
DisGeNET:
infores_curie: infores:disgenet
knowledge_type: aggregator_knowledge_source
DOID:doid.owl:
infores_curie: infores:disease-ontology
knowledge_type: knowledge_source
- Edit
filter_kg_and_remap_predicates.py
to assign the new property/properties.
def make_arg_parser():
arg_parser = argparse.ArgumentParser(description='filter_kg.py: filters and simplifies the KG2 knowledge grpah for the RTX system')
arg_parser.add_argument('predicateRemapYaml', type=str, help="The YAML file describing how predicates should be remapped to simpler predicates")
+ arg_parser.add_argument('inforesRemapYaml', type=str, help="The YAML file describing how provided_by fields should be remapped to Translator infores curies")
arg_parser.add_argument('curiesToURIFile', type=str, help="The file mapping CURIE prefixes to URI fragments")
arg_parser.add_argument('inputFileJson', type=str, help="The input KG2 grah, in JSON format")
arg_parser.add_argument('outputFileJson', type=str, help="The output KG2 graph, in JSON format")
arg_parser.add_argument('versionFile', type=str, help="The text file storing the KG2 version")
arg_parser.add_argument('--test', dest='test', action='store_true', default=False)
arg_parser.add_argument('--dropSelfEdgesExcept', required=False, dest='drop_self_edges_except', default=None)
arg_parser.add_argument('--dropNegated', dest='drop_negated', action='store_true', default=False)
return arg_parser
if __name__ == '__main__':
args = make_arg_parser().parse_args()
predicate_remap_file_name = args.predicateRemapYaml
+ infores_remap_file_name = args.inforesRemapYaml
curies_to_uri_file_name = args.curiesToURIFile
input_file_name = args.inputFileJson
output_file_name = args.outputFileJson
test_mode = args.test
drop_negated = args.drop_negated
drop_self_edges_except = args.drop_self_edges_except
if drop_self_edges_except is not None:
assert type(drop_self_edges_except) == str
drop_self_edges_except = set(drop_self_edges_except.split(','))
predicate_remap_config = kg2_util.safe_load_yaml_from_string(kg2_util.read_file_to_string(predicate_remap_file_name))
+ infores_remap_config = kg2_util.safe_load_yaml_from_string(kg2_util.read_file_to_string(infores_remap_file_name))
map_dict = kg2_util.make_uri_curie_mappers(curies_to_uri_file_name)
[curie_to_uri_expander, uri_to_curie_shortener] = [map_dict['expand'], map_dict['contract']]
graph = kg2_util.load_json(input_file_name)
new_edges = dict()
relation_curies_not_in_config = set()
+ provided_by_curies_not_in_config_nodes = set()
+ provided_by_curies_not_in_config_edges = set()
record_of_relation_curie_occurrences = {relation_curie: False for relation_curie in
predicate_remap_config.keys()}
command_set = {'delete', 'keep', 'invert', 'rename'}
for relation_curie, command in predicate_remap_config.items():
assert len(command) == 1
assert next(iter(command.keys())) in command_set
relation_curies_not_in_nodes = set()
- nodes_dict = {node['id']: node for node in graph['nodes']}
+ nodes_dict = dict()
+ for node_dict in graph['nodes']:
+ node_id = node_dict['id']
+ provided_by = node_dict['provided_by']
+ infores_curie_dict = infores_remap_config.get(provided_by, None)
+ if infores_curie_dict is None:
+ provided_by_curies_not_in_config_nodes.add(provided_by)
+ else:
+ infores_curie = infores_curie_dict['infores_curie']
+ node_dict['knowledge_source'] = infores_curie
+ nodes_dict[id] = node_dict
edge_ctr = 0
for edge_dict in graph['edges']:
edge_ctr += 1
if edge_ctr % 1000000 == 0:
print('processing edge ' + str(edge_ctr) + ' out of ' + str(len(graph['edges'])))
if drop_negated and edge_dict['negated']:
continue
relation_label = edge_dict['relation_label']
predicate_label = relation_label
relation_curie = edge_dict['relation']
predicate_curie = relation_curie
if record_of_relation_curie_occurrences.get(relation_curie, None) is not None:
record_of_relation_curie_occurrences[relation_curie] = True
pred_remap_info = predicate_remap_config.get(relation_curie, None)
else:
# there is a relation CURIE in the graph that is not in the config file
relation_curies_not_in_config.add(relation_curie)
pred_remap_info = {'keep': None}
assert pred_remap_info is not None
invert = False
get_new_rel_info = False
if pred_remap_info is None:
assert relation_curie in relation_curies_not_in_config
else:
if 'delete' in pred_remap_info:
continue
remap_subinfo = pred_remap_info.get('invert', None)
if remap_subinfo is not None:
invert = True
get_new_rel_info = True
else:
remap_subinfo = pred_remap_info.get('rename', None)
if remap_subinfo is None:
assert 'keep' in pred_remap_info
else:
get_new_rel_info = True
if get_new_rel_info:
predicate_label = remap_subinfo[0]
predicate_curie = remap_subinfo[1]
if invert:
edge_dict['relation_label'] = 'INVERTED:' + relation_label
new_object = edge_dict['subject']
edge_dict['subject'] = edge_dict['object']
edge_dict['object'] = new_object
edge_dict['predicate_label'] = predicate_label
if drop_self_edges_except is not None and \
edge_dict['subject'] == edge_dict['object'] and \
predicate_label not in drop_self_edges_except:
continue # see issue 743
edge_dict['predicate'] = predicate_curie
if predicate_curie not in nodes_dict:
predicate_curie_prefix = predicate_curie.split(':')[0]
predicate_uri_prefix = curie_to_uri_expander(predicate_curie_prefix + ':')
if predicate_uri_prefix == predicate_curie_prefix:
relation_curies_not_in_nodes.add(predicate_curie)
+ provided_by = edge_dict['provided_by']
+ infores_curie_dict = infores_remap_config.get(provided_by, None)
+ if infores_curie_dict is None:
+ provided_by_curies_not_in_config_edges.add(provided_by)
+ else:
+ infores_curie = infores_curie_dict['infores_curie']
- edge_dict['provided_by'] = [edge_dict['provided_by']]
+ edge_dict['provided_by'] = [provided_by]
+ edge_dict['knowledge_source'] = [infores_curie]
edge_key = edge_dict['subject'] + ' /// ' + predicate_label + ' /// ' + edge_dict['object']
existing_edge = new_edges.get(edge_key, None)
if existing_edge is not None:
existing_edge['provided_by'] = sorted(list(set(existing_edge['provided_by'] + edge_dict['provided_by'])))
+ existing_edge['knowledge_source'] = sorted(list(set(existing_edge['knowledge_source'] + edge_dict['knowledge_source'])))
existing_edge['publications'] += edge_dict['publications']
existing_edge['publications_info'].update(edge_dict['publications_info'])
else:
new_edges[edge_key] = edge_dict
del graph['edges']
+ del graph['nodes']
+ graph['nodes'] = list(nodes_dict.values())
del nodes_dict
graph['edges'] = list(new_edges.values())
del new_edges
for relation_curie in record_of_relation_curie_occurrences:
if not record_of_relation_curie_occurrences[relation_curie]:
print('relation curie is in the config file but was not used in any edge in the graph: ' + relation_curie, file=sys.stderr)
for relation_curie in relation_curies_not_in_nodes:
print('could not find a node for relation curie: ' + relation_curie)
relation_curies_not_in_config_for_iteration = list(relation_curies_not_in_config)
for relation_curie_not_in_config in relation_curies_not_in_config_for_iteration:
if not relation_curie_not_in_config.startswith(kg2_util.CURIE_PREFIX_BIOLINK + ':'):
print('relation curie is missing from the YAML config file: ' + relation_curie_not_in_config,
file=sys.stderr)
else:
relation_curies_not_in_config.remove(relation_curie_not_in_config)
+ for provided_by_curies_not_in_config_node in provided_by_curies_not_in_config_nodes:
+ print('provided_by node curie is missing from the YAML config file: ' + provided_by_curies_not_in_config_node,
+ file=sys.stderr)
+ for provided_by_curies_not_in_config_edge in provided_by_curies_not_in_config_edges:
+ print('provided_by node curie is missing from the YAML config file: ' + provided_by_curies_not_in_config_edge,
+ file=sys.stderr)
if len(relation_curies_not_in_config) > 0:
print("There are relation curies missing from the yaml config file. Please add them and try again. Exiting.", file=sys.stderr)
exit(1)
+ if len(provided_by_curies_not_in_config_nodes) > 0:
+ print("There are nodes provided_by curies missing from the yaml config file. Please add them and try again. Exiting.", file=sys.stderr)
+ exit(1)
+ if len(provided_by_curies_not_in_config_edges) > 0:
+ print("There are edges provided_by curies missing from the yaml config file. Please add them and try again. Exiting.", file=sys.stderr)
+ exit(1)
update_date = datetime.now().strftime("%Y-%m-%d %H:%M")
version_file = open(args.versionFile, 'r')
build_name = str
for line in version_file:
test_flag = ""
if test_mode:
test_flag = "-TEST"
build_name = "RTX KG" + line.rstrip() + test_flag
break
build_node = kg2_util.make_node(kg2_util.CURIE_PREFIX_RTX + ':' + 'KG2',
kg2_util.BASE_URL_RTX + 'KG2',
build_name,
kg2_util.SOURCE_NODE_CATEGORY,
update_date,
kg2_util.CURIE_PREFIX_RTX + ':')
build_info = {'version': build_node['name'], 'timestamp_utc': build_node['update_date']}
pprint.pprint(build_info)
graph["build"] = build_info
graph["nodes"].append(build_node)
kg2_util.save_json(graph, output_file_name, test_mode)
del graph
- Update
kg_json_to_tsv.py
.
Instructions the same as above (see ETL-Based Property Add Section 4)
- Update
run-simplify.sh
andmaster-config.shinc
.
If you've updated the parameters of filter_kg_and_remap_predicates.py
, you will need to update run-simplify.sh
(which runs that script) to adjust to the new parameters. In addition, you will want to update master-config.shinc
to include that new parameter.
if [ -z ${test_suffix+x} ]; then test_suffix=""; fi
BUILD_DIR=~/kg2-build
VENV_DIR=~/kg2-venv
CODE_DIR=~/kg2-code
umls_dir=${BUILD_DIR}/umls
umls_dest_dir=${umls_dir}/META
s3_region=us-west-2
s3_bucket=rtx-kg2
s3_bucket_public=rtx-kg2-public
s3_bucket_versioned=rtx-kg2-versioned
s3_cp_cmd="aws s3 cp --no-progress --region ${s3_region}"
mysql_conf=${BUILD_DIR}/mysql-config.conf
curl_get="curl -s -L -f"
curies_to_categories_file=${CODE_DIR}/curies-to-categories.yaml
curies_to_urls_file=${CODE_DIR}/curies-to-urls-map.yaml
predicate_mapping_file=${CODE_DIR}/predicate-remap.yaml
+infores_mapping_file=${CODE_DIR}/kg2-provided-by-curie-to-infores-curie.yaml
ont_load_inventory_file=${CODE_DIR}/ont-load-inventory${test_suffix}.yaml
umls2rdf_config_master=${CODE_DIR}/umls2rdf-umls.conf
rtx_config_file=RTXConfiguration-config.json
biolink_model_version=2.1.0
${VENV_DIR}/bin/python3 -u ${CODE_DIR}/filter_kg_and_remap_predicates.py ${test_flag} --dropNegated \\
--dropSelfEdgesExcept interacts_with,positively_regulates,inhibits,increase \\
- ${predicate_mapping_file} ${curies_to_urls_file} ${input_json} ${output_json} \\
+ ${predicate_mapping_file} ${infores_mapping_file} ${curies_to_urls_file} ${input_json} ${output_json} \\
${local_version_filename}
- (Optional) Add validation script for mapping file.
The purpose of this is to verify our own, local mapping file against Biolink mappings. However, since Biolink doesn't yet contain infores mappings, this will be developed later for this specific example.