You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm aiming to generate protein graphs in bulk in order to then perform unsupervised clustering on them. I would also like to repeat this process on several different proteomes.
I would also like to apply several intermediate steps (e.g. select subgraph of radius r for each graph; select subgraph of threshold rsa)
So far, I have seen that ProteinGraphDataset retrieves PDB files from a list of ids (either UniProt or PDB accession codes) and downloads from PDB or AF2, and the 'intermediate steps' can be achieved by supplying functions to the graph_transformation_funcs parameter.
However, I would like to use a subset of a proteome (list of IDs) and an already existing set of .pdb files in a directory (as opposed to downloading them again). Would it be possible for a more elegant solution to exist in a similar fashion to the existing command line interface?
I was thinking that some sort of 'pipeline' could be written as a CLI command, perhaps by providing
path to file containing list of protein IDs
Path to directory containing structures (also where new ones will be downloaded if required)
which database to use if UniProt IDs used (e.g. swissprot or AF2)
path to config.yml file for graph construction
path to graph_processing.yml file detailing a list of functions to apply (e.g. subgraph selection)
output path for graphs (can specify format, e.g. nx.Graph or pyg)
This is just my naive idea for now, I haven't fleshed out exactly how it would work; but maybe a way to describe 'transformations' in a processing.yml file in a similar way to the ProteinGraphConfig parser?
I think a framework that allows people to script pipelines (like the one I am trying to make) from the command line would allow for ease of experimentation and simplicity, compared to making it all in python using the low-level functions.
Would appreciate any thoughts on this!
The text was updated successfully, but these errors were encountered:
To address your immediate problem, I think you can try just passing the filenames (no extension) as the pdb_code arg in ProteinGraphDataset. The download is only triggered if the files are not found in the DATA_DIR/raw directory so if you place your PDBs there it should behave how you want it to.
I can provide some support and help implement some of this if you're keen to build this feature. I don't have the bandwidth at the moment to pick this up on my own though.
Sure, I've already built something like this for my own use case so would be happy to figure out an elegant way to make it generalisable and add it to the graphein CLI. Will let you know if I'm stuck!
I'm aiming to generate protein graphs in bulk in order to then perform unsupervised clustering on them. I would also like to repeat this process on several different proteomes.
I would also like to apply several intermediate steps (e.g. select subgraph of radius
r
for each graph; select subgraph of thresholdrsa
)So far, I have seen that
ProteinGraphDataset
retrieves PDB files from a list ofid
s (either UniProt or PDB accession codes) and downloads from PDB or AF2, and the 'intermediate steps' can be achieved by supplying functions to thegraph_transformation_funcs
parameter.However, I would like to use a subset of a proteome (list of IDs) and an already existing set of
.pdb
files in a directory (as opposed to downloading them again). Would it be possible for a more elegant solution to exist in a similar fashion to the existing command line interface?I was thinking that some sort of 'pipeline' could be written as a CLI command, perhaps by providing
config.yml
file for graph constructiongraph_processing.yml
file detailing a list of functions to apply (e.g. subgraph selection)This is just my naive idea for now, I haven't fleshed out exactly how it would work; but maybe a way to describe 'transformations' in a
processing.yml
file in a similar way to theProteinGraphConfig
parser?I think a framework that allows people to script pipelines (like the one I am trying to make) from the command line would allow for ease of experimentation and simplicity, compared to making it all in python using the low-level functions.
Would appreciate any thoughts on this!
The text was updated successfully, but these errors were encountered: