This project was conducted in the winter term 2021/22 at the Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany. Its goal is to perform analyses on metadata of COVID-19 sequencing data in order to identify sampling biases.
The EBI metadata was retrieved from the ENA Portal by
- selecting data type "Nucleotide sequences" & click "next"
- selecting "Taxonomy and related", filter for "NCBI Taxonomy", typing "Severe acute respiratory syndrome coronavirus 2", including subordinate taxa & double-click "next" to go to "Fields"
- selecting all metadata fields & click "Search"
The definitions of the included columns can be found in definitions_EBI_metadata.json, taken from the API documentation of the ENA Portal (see page 36 to 42).
The GISAID metadata was retrieved as a JSON file via an internal API access to the GISAID database. A rough documentation of the included columns can be found in the submission protocol of the GISAID portal.
The data used to integrate the number of infections (per country or per U.S. State) in our analyses is provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). It can be found in the corresponding GitHub repository of the JHU CSSE.
The project is divided into several Juypter Notebooks and Python scripts dealing with EBI or GISAID metadata. The analyses of both metadata files each can be found in a separate folder:
- EBI Metadata Analyses:
- preprocess_metadata.ipynb removes empty columns and cleans the provided metadata. It has to be executed before the following notebooks because it also adds several columns needed for later analysis.
- create_general_insights.ipynb generates a profiling report as HTML about all available columns and produces general insights, e.g. regarding specific biases or columns.
- analyze_missing_values.ipynb investigates if missing values are introduced by certain countries and creates PNGs disaggregating how many missing values exist in the different countries per column.
- analyze_us_states.ipynb extracts the U.S. States from the "region" column and analyzes the number of samples per U.S. State in relation to the respective number of inhabitants and infections of the state.
- GISAID Metadata Analyses:
- preprocess_metadata.ipynb converts the given JSON file to a TSV file, cleans the resulting TSV and saves it as a CSV file. It has to be executed before the following notebooks because it also adds columns needed for later analysis.
- convert_json_to_tsv.py provides the JSON-to-TSV conversion as a single script for external use.
- create_general_insights.ipynb produces a general overview of the available features, sequence counts per country and available virus variants.
All notebooks that read a CSV at first execute various analyses on it afterwards, each of which is introduced with a heading in a Markdown cell. This means that before the cells of an analysis can be executed, the first cell of such a notebook that reads the mentioned CSV must be executed first. Once this first cell has been executed, it does not have to be executed again for the cells of other analyses in this notebook - meaning each analysis, if any, enriches this CSV without changing the original data.