Skip to content

Google Summer of Code

rupav jain edited this page Aug 13, 2018 · 158 revisions
Google Summer of Code Logo

Timeline

Year Student Mentor(s)
2017 Achilles Rasquinha Dr. Akram Mohammed, Dr. Tomas Helikar
2018 Rupav Jain Dr. Akram Mohammed, Achilles Rasquinha

Google Summer of Code 2017


by Achilles Rasquinha

Table of Contents

Overview


CancerDiscover is a complete end-to-end Machine Learning Pipeline (from Data Preprocessing to Model Deployment) dedicated for DNA Microarray data analysis and modelling. The toolkit includes an AffyMetrix CEL file to ARFF convertor, predefined Search and Evaluator algorithm combinations for Feature Selection and support for the SLURM Workload Manager. While the entire pipeline has been primarily written in Perl, its dependencies included languages like R, Java and a bit of Bash. This made the overall pipeline quite heavy in terms of its dependencies thereby making it difficult to have it deployed on machines, especially Windows-based.

The primary requirement for this summer was to have a neat Graphical User Interface built on top of CancerDiscover. Unlike many GSoC Projects which includes major contributions to existent code bases, GSoC '17 opened way for a new neatly rewritten OSS project - candis (a portmanteau of the words cancer and discover), as a major extension and upgrade to the existent pipeline.

Work this Summer


Community Bonding Phase

5th May 2017 - 30th May 2017

During the initial meet of the Community Bonding Phase, Dr. Akram expressed a keen requirement to have the pipeline parse for more than just two feature vectors (CancerDiscover's ARFF parser was as of now incapable of doing this). This limited the degree of freedom for users of the pipeline to build better prediction models which could then acquire a better understanding of the data to be passed through the pipeline. candis currently ensures that the pipeline is indeed capable of parsing data sets of any number of dimensions/features (Commit 7304753).

Another concern raised during the phase was to have a Rich Internet Application (RIA) instead of a Qt-based Graphical User Interface (which had been initially proposed). This would help in creating a better ecosystem for the to-be built application with great flexibility in design, server-based processing and easy deployment. We zeroed down to considering Python as our default language of choice to help us rewrite the pipeline and consider a Flask-based server (under the candis.app.server module) and React (under the candis.app.client module) as our front-end framework for the RIA. candis currently has a dedicated sub-module for the RIA under the candis.app Python module.

In short, we were able to transform from this (a prototype submitted during the Application Phase)

candis-GUI

...to this

candis-RIA
First Phase

30th May 2017 - 30th June 2017

Behind the Curtains

IO Handlers

candis currently comprises of two IO Handlers, namely:

Object Extension Purpose
CData .cdata Parsing, Viewing, Pre-processing, Converting and Serializing input datasets.
Pipeline .cpipe Configuring, Initiating, Manipulating and Running the ML Pipeline.

The primary goal of such IO handlers were to work flexibly with the RIA as well as the Command Line Interface. A candis.CData object instance acts as a wrapper around the input data set file that caters to Genome Data Files (in this case, AffyMetrix CEL files) (NOTE: Extending this to other file formats are open for contribution). Currently, a CData object instance handles Quality Control Checks (Background Correction, Normalization, Phenotype Microarray Correction and Summarization) in order to generate expression set values. There's not much magic here, we're just having a Bioconductor Library - affy as a dependency to perform pre-processing on an AffyBatch (a list of CEL files). What helped us make this happen is to use rpy2 as a dependency which acts as an interface between Python and R, thereby helping candis to pre-process CEL files (Commit 4ae5621).

R down, Perl/Bash to go.

Draw the Curtains

Viewing and Manipulating CDATA files on the RIA required a rich Excel-like viewer and editor widget included. I had 4 primary React components namely - FileEditor, DataEditor and FileViewer implemented wrapped around a multi-purpose Modal component. We used react-data-grid as an extension to the above components.

Check out the demos on YouTube:

File Editor
File Viewer

Curtains, and ReST.

I love ReST, and candis's app is very much ReST-driven. Since our pipeline consists of chunked stages, it's then best to have chunked routes too. candis's currently holds the following major routes:

Route Purpose
/api/data Reading, Writing, Resource Discovery (Fetching Files)
/api/pipeline Running Pipelines and Querying Status
/api/preprocess Querying currently available Preprocessing methods
/api/featselect Querying currently available Feature Selection methods
/api/model Querying currently available learning algorithms

I believe there's a need for improvement here. One, the pipeline currently runs as one single-threaded process; the routes /api/preprocess, /api/featselect and /api/model therefore must incorporate suitable IO-based actions. And the other, is to reduce the overall overhead within /api/pipeline (I've witnessed the application tends to unhandle errors, thus leaving a confused client-state. Moreover, pipelines surely require better thread management and throughput).

The problem with ReST is that there's no standardized response structure (unlike GraphQL). I went ahead and built one instead. candis's Response object is a combination of JSend and Google JSON Guidelines specification (Commit b81fabc).

Why ReST-driven? Primarily because Machine Learning Pipelines for DNA Microarrays are computationally heavy in terms of both, memory and speed. A server-side execution-based application structure makes candis easily deployable. In fact, an instance of candis.app is been continuously deployed on Heroku.

Custom Configuration

Customizing candis to one's needs is easy. I like the way the addict library revamps Python's dict object with a JavaScript's Object-like interface. I worked extensively in building candis's Config data structure that comprises of configuration parameters for the CLI as well as the RIA, and even the Pipeline (Commit df86a89). A candis.Config object works similar to the way Python's dict works, with the exception of it being an n-ary tree-like.

Each leaf node of the tree holds a configuration value. A leaf node is denoted by an uppercase attribute whereas each internal node is denoted by a capitalized attribute. - excerpts from the Documentation

This is an inspiration from CancerDiscover's Configuration.txt file that helps users customize pipelines. However in the case of candis, you customize the entire application's state. Custom Configuration comes along with a cache manager and thanks to it, candis.Cache can now customize your configuration with values present within your $HOMEPATH/.candis/config.json file (Commit 88fbf331).

High-Level Abstraction

The canonical way of importing candis is as follows:

>>> import candis

That's it! You've got access to pretty much everything candis has to offer.

candis is built with high-abstraction and modularity, which means high-level APIs are a few LOCs away to help you build powerful ML models. For instance, parsing a CData instance to an ARFF is as easy as:

>>> cdata = candis.CData.load('path/to/filename.cdata')
>>> cdata.toARFF('path/to/filename.arff')

or how about re-configuring and running a pipeline:

>>> pipeline = candis.Pipeline(config = { 'preprocess': { 'background_correction': 'rma' }})
>>> pipeline.run(cdata)

The same goes for registering widgets. Almost all necessary tools (to build the pipeline) can be accessed via the ToolBox component.

candis | Toolbox

As one can see, the toolbox consists of various instances of a Compartment object which then consists various Tool widgets. Registering new compartments and tools can be done within the compartments metadata object. You could even have them registered asynchronously! Compartment's fetcher prop will go ahead and build that for you (Commits - df86a898, 939f4de, 3e392e2, d9fdae5, 4be1f06).

A scope for improvement here would be to create a standardized data response for tools when fetched asynchronously. There should be a direct mapping between tools and pipeline stages. As of now, there isn't any.

Second Phase

30th June 2017 - 28th July 2017

Behind the Curtains

Currently the RIA represents a Pipeline as a sequence of stages. And I believe this is open for explorartion; primarily because ML Pipelines can be viewed as Graphs too (which had been my initial intuition and prototype). A data-flow paradigm is infact ideal; but for now, sequences of stages.

CancerDiscover uses a pre-defined list of Feature Selection algorithm combinations' lookup provided by WEKA. Currently, candis makes it way more robust and "tweakable" by helping you to register alogirithm combinations to-be used within your $HOMEPATH/.candis/config.json file. The current structure is as follows:

>>> import random
>>> randomc.choice(candis.CONFIG.Pipeline.FEATURE_SELECTION)
{'evaluator': {'name': 'CfsSubsetEval'},
  'search': {'name': 'BestFirst', 'options': ['-D', '1', '-N', '5']},
  'use': False}

As one can see, you pass not just the Evaluator and Search class names but also a set of desired parameters (options). This however, isn't available on the RIA yet and therefore; is open for contribution. There must be a neat way of passing parameter metadata upfront and values back.

Also, registering models goes the same way:

>>> random.choice(candis.CONFIG.Pipeline.MODEL)
{'label': 'k-Nearest Neighbor', 'name': 'lazy.IBk', 'use': False}

Adding the options parameter would help you tweak models too.

WEKA in Python

candis uses python-weka-wrapper (which then uses the python-javabridge library for running and accessing the Java Virtual Machine) to utilize everything (almost) WEKA has to offer. For MacOS + Python3 users, a quick warnings is that candis uses a bleeding-edge version of python-javabridge (Check out python-javabridge's Issue #111).

Future Work

  • Currently, candis parses AffyMetrix CEL Files alone. Dr. Akram expressed a keen desire to include more DNA Microarray IO handlers into the candis.ios module.
  • I'd initially worked on Querying, Searching and Downloading for datasets from the National Centre for Biotechnology Information (NCBI) repository. I'd kept this aside in order to focus more the RIA. The work (in progress) can be found under the candis.data.entrez module. This acts as an API wrapper to NCBI's extremely powerful server-side API - Entrez.
  • I'd like a toDataFrame setter attached (which returns a pandas.DataFrame) to the CData object. This would then open an array of opportunities (no pun intendeted) to perform Data Analysis on DNA Microarrays within Python (no wrappers, no dependencies, just pure Python).

Acknowledgements

candis, although written by me, is a brainchild of Dr. Akram and his team. I can't thank him enough for the immense support he'd provided me throughout the bonding and development phase. He's zealous when it comes to building neat interfaces for the Bioinformatics community (there's a dearth need for the same) and we hope, candis meets these goals. Right from providing domain knowledge to building the end-application, Dr. Akram was there to guide. Without his help and support (resources, guidance and mentorship), candis wouldn't end up being cutting-edge. He puts a great amount of faith in his students to achieve the results desired and I thank him immensely for having that faith in me to have this almost production-ready by the end of the programme. I'd also like to thank Dr. Tomas for providing me the green light to build a server-based application and his constructive inputs during the initial stages of development.