Merge pull request #50 from bigbio/dev

Major PR [DO NOT MERGE]
bigbio · Sep 11, 2024 · a3eff82 · a3eff82
2 parents 16f0361 + b9a4d7f
commit a3eff82
Show file tree

Hide file tree

Showing 77 changed files with 13,656 additions and 1,674 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -47,5 +47,4 @@ jobs:
     - name: Test with pytest
       run: |
         python setup.py install
-        cd quantmsio
         python -m unittest
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -38,5 +38,4 @@ jobs:
     - name: Test with pytest
       run: |
         python setup.py install
-        cd quantmsio
-        python -m unittest
+        pytest -v -s tests/
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@ dist
 build
 .idea
 data
-docs/.vscode
+docs/.vscode
+__pycache__
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
diff --git a/MANIFEST.md → MANIFEST.in b/MANIFEST.md → MANIFEST.in
@@ -5,4 +5,4 @@ include README.md
 include LICENSE
 
 # Include the data files
-recursive-include quantms_io *.xml *.yml
+recursive-include quantmsio *.xml *.yml *.tsv
diff --git a/README.md b/README.md
@@ -6,60 +6,14 @@
 [![Documentation Status](https://readthedocs.org/projects/quantmsio/badge/?version=latest)](https://quantmsio.readthedocs.io/en/latest/?badge=latest)
 [![PyPI version](https://badge.fury.io/py/quantmsio.svg)](https://badge.fury.io/py/quantmsio)
 
-[quantms](https://quantms.org) is a nextflow pipeline for the analysis of quantitative proteomics data. The pipeline is based on the [OpenMS](https://www.openms.de/) framework and [DIA-NN](https://github.com/vdemichev/DiaNN); and it is designed to analyze large scale experiments. The main outputs of quantms workflow are the following: 
-
-- [mzTab](https://github.com/HUPO-PSI/mzTab) files with the identification and quantification information.
-- [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) input file with the peptide quantification values needed for the MSstats analysis.
-- [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) output file with the differential expression values for each protein. 
-- The input [SDRF](https://github.com/bigbio/proteomics-sample-metadata) of the pipeline if available. 
-
-While all the previous formats are well-known standards and popular formats in the proteomics community; they are difficult to use in big data analysis projects. In addition, these file formats are difficult to extend and provide multiple views of the underlying data. For example, in mzTab it is extremely hard for big datasets to retrieve the identified peptides and features and the corresponding intensities. At the same time it is difficult to get the protein quantification values for a given sample.  
-
-Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis. The main use cases for the format are:  
-
-- Fast and easy visualization of the identification and quantification results.
-- Easy integration with other omics data.
-- Easy integration with sample metadata.
-- AI/ML model development based on identification and quantification results.
-- Easy data retrieval for big datasets and large-scale collections of proteomics data.
-
->**Note**: We are not trying to replace the mzTab format, but to provide a new format that enables AI-related use cases. Most of the features of the mzTab format will be included in the new format.  
-
-## Data model
-
-quantms.io could be seen as a **multiple view** representation of a proteomics data analysis results. Each view of the format can be serialized in different formats depending on the use case. the **data model** of quantms.io defines two main things, the **view** and how the view is **serialized**. 
-
-- The **data model view** defines the structure, the fields and properties that will be included in a view for each peptide, psms, feature or protein, for example.    
-- The **data serialization** defines the format in which the view will be serialized and what features of serialization will be supported, for example compression, indexing or slicing.
-
-| view         | file class        | serialization format | definition                                                      | example                                                                                                       |
-|:-------------|:------------------|:---------------------|:----------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------|
-| psm          | psm_file          | _parquet_            | [psm](docs/psm.rst)                                             | [psm example](docs/include/PXD002854-80934754-57c1-47e2-9951-787ef703a484.psm.parquet)                        |
-| feature      | feature_file      | _parquet_            | [feature](docs/feature.rst)                                     | [feature example](docs/include/PXD004683-219a8a0a-d6a8-44c9-9e51-1851876d2f69.feature.parquet)                |
-| absolute     | absolute_file     | _tsv_                | [absolute](docs/ae.rst)                                         | [absolute example](docs/include/PXD004683-quantms.tsv)                                                        |
-| differential | differential_file | _tsv_                | [differential](docs/differential.rst)                           | [differential example](docs/include/PXD004683-219a8a0a-d6a8-44c9-9e51-1851876d2f69.differential.tsv)          |
-| sdrf         | sdrf_file         | _tsv_                | [metadata](https://github.com/bigbio/proteomics-sample-metadata)| [sdrf example](https://github.com/bigbio/proteomics-sample-metadata/tree/master/annotated-projects/PXD000612) |
-|project | - | _json_ | [project](docs/project.rst) | -- |
-
-> **Note**: Views can be extended and new views can be added to the format.
-
-### Introduction to quantms.io
-
-A quantms.io file is a collection of views, and they are aggregated into a folder `.qms` and inside that folder a file collect `project.json` MUST be present. Please read about the [project view](docs/project.rst) for more information. 
-
-The introduction to the format, concepts and more details topics about serialization can be read in the introduction to the format [here](docs/introduction.rst).
-
-## How to contribute
-
-External contributors, researchers and the proteomics community are more than welcome to contribute to this project.
-
-Contribute with the specification: you can contribute to the specification with ideas or refinements by adding an issue into the [issue tracker](https://github.com/bigbio/proteomics-quant-formats/issues) or performing a PR.
+The specification of quantms.io is a community-driven effort to create a standard for the representation of quantitative proteomics data. The specification is designed to be simple to implement and to be able to represent the most common types of quantitative proteomics data. The specification is designed to be extensible, and the full specification can be found in [the specification document](docs/README.adoc).
 
 ## Core contributors and collaborators
 
 The project is run by different groups:
 
 - Yasset Perez-Riverol (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
+- Ping Zheng (Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China)
 
 IMPORTANT: If you contribute with the following specification, please make sure to add your name to the list of contributors.