This is a repository for classifying protein domains of the PFam dataset available at on Kaggle. Mirror of a cleaned private repository done for a coding challenge.
The repository contains several directories:
code/
: this is the main code for running the training and get the results of the neural networkslatex/
: directory containing the LaTeX source files as well as themain.pdf
reportproteinclass/
: Python library written and used incode/
written using PyTorch for the deep learning part.
The library is written in an object-oriented fashion and is inspired by the pytorch-template made by @victoresque on GitHub and the PlasmaNet project. The content of each file is described below:
dataloader.py
: contains theProteinDataset
class which overloads the baseDataset
class from PyTorch and reads the PFam databaselog.py
: contains functions for creating logging objectsmodel.py
: contains the neural network architectures. A base class is first written which inherits from thenn.Module
class of PyTorch. This base class holds a method for computing the number of training parameters and the other three classes inherit from it.multiple_train.py
: script to launch multiple trainings sequentiallypproc.py
: routines for post-processing themetrics.h5
generated during trainingpredict.py
: script for inference of a given set of sequencestrain.py
: performs the actual training of neural networkstrainer.py
: contains theTrainer
class instantiated intrain.py
util.py
: contains common functions used across the library
The library is called via executables created in setup.py
:
entry_points={
'console_scripts': [
'train_network=proteinclass.train:main',
'train_networks=proteinclass.multiple_train:main',
'plot_metrics=proteinclass.pproc:main',
'predict=proteinclass.predict:main'
],
},
Each of these executables are run via a configuration file written in YAML
format (for example code/results/200labels/simple.yml
for the train_network
executable). These configuration files are converted into Python dictionnaries which are parsed throughout the code.
To install the library you need Python 3.8 or above version. Run the following command at the root of the repository (preferably in a virtual environment):
pip install -e .
If there are some missing packages, an image of a working python 3.9 version on Mac can be found in requirements.txt
and so the following command should be run:
pip install -r requirements.txt
To run PyTorch on GPUs special versions of PyTorch should be installed with CUDA. Please refer to this link for a proper installation. For PyTorch 1.10.1 with CUDA 11.1 on Linux for example run the following command on terminal:
pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
To run the code the dataset from https://www.kaggle.com/googleai/pfam-seed-random-split should be downloaded and put inside the code/
repository so that there is an code/input/random_split
directory.
After that the 200 labels training can be launched by executing run.sh
inside results/200labels/
:
./run.sh
The training done using 5000 and 15000 labels can be performed by executing run.sh
inside the results/MoreLabels/
directory.
All the figures of the first part of the report can be plotted by running the main.py
script inside code/dataset_analysis
:
cd code/dataset_analysis
python main.py