Add initial public release

twrightsman · Apr 11, 2024 · 8a046c5 · 8a046c5
commit 8a046c5
Show file tree

Hide file tree

Showing 107 changed files with 40,614 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+results/*/*.ipynb filter=lfs diff=lfs merge=lfs -text
+results/*/figures/*.png filter=lfs diff=lfs merge=lfs -text
diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml
@@ -0,0 +1,50 @@
+name: Render manuscript on change
+
+on:
+  push:
+    paths:
+      - 'docs/manuscript/**.typ'
+      - 'docs/manuscript/pixi.lock'
+      - 'results/*/figures/*'
+  workflow_dispatch:
+
+env:
+  PIXI_VERSION: 0.15.1
+
+jobs:
+  render:
+    name: Render manuscript
+    runs-on: ubuntu-22.04
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          lfs: true
+      - name: Cache pixi binary
+        id: cache-pixi
+        uses: actions/cache@v4
+        with:
+          path: ~/.local/bin/pixi
+          key: cache-pixi-${{ env.PIXI_VERSION }}
+      - if: ${{ steps.cache-pixi.outputs.cache-hit != 'true' }}
+        name: Install pixi
+        run: |
+          mkdir --parents $HOME/.local/bin
+          wget https://github.com/prefix-dev/pixi/releases/download/v${PIXI_VERSION}/pixi-x86_64-unknown-linux-musl --output-document=$HOME/.local/bin/pixi
+          chmod u+x $HOME/.local/bin/pixi
+      - name: Cache rattler cache
+        uses: actions/cache@v4
+        with:
+          path: ~/.cache/rattler
+          key: cache-rattler-manuscript-${{ hashFiles('docs/manuscript/pixi.lock') }}
+      - name: Build environment
+        run: $HOME/.local/bin/pixi install
+        working-directory: docs/manuscript
+      - name: Render manuscript
+        run: $HOME/.local/bin/pixi run build
+        working-directory: docs/manuscript
+      - name: Upload manuscript as artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: manuscript
+          path: docs/manuscript/manuscript.pdf
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,19 @@
+tmp/
+
+# Pixi
+.pixi/
+
+# Python
+*.egg-info/
+__pycache__/
+*.py[cod]
+
+# Vim
+*.swp
+.idea/
+
+# IDE
+.vscode/
+
+# Jupyter
+.ipynb_checkpoints/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,7 @@
+Copyright © 2024 Travis Wrightsman
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,6 @@
+# Current genomic deep learning architectures generalize across grass species but not alleles
+
+## Dependencies
+
+- [Pixi](https://pixi.sh)
+- [git-lfs](https://git-lfs.github.com)
diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1 @@
+genomes/
diff --git a/data/samples.tsv b/data/samples.tsv
diff --git a/docs/manuscript/.gitignore b/docs/manuscript/.gitignore
@@ -0,0 +1,2 @@
+manuscript.pdf
+.pixi/
diff --git a/docs/manuscript/README.md b/docs/manuscript/README.md
@@ -0,0 +1,5 @@
+# Manuscript
+
+```
+pixi run build
+```
diff --git a/docs/manuscript/content/00-abstract.typ b/docs/manuscript/content/00-abstract.typ
@@ -0,0 +1,12 @@
+Non-coding regions of the genome are just as important as coding regions for understanding the mapping from genotype to phenotype.
+Interpreting deep learning models trained on RNA-seq is an emerging method to highlight functional sites within non-coding regions.
+Most of the work on RNA abundance models has been done within humans and mice, with little attention paid to plants.
+Here, we benchmark four genomic deep learning model architectures with genomes and RNA-seq data from 18 species closely related to maize and sorghum within the Andropogoneae.
+The Andropogoneae are a tribe of C4 grasses that have adapted to a wide range of environments worldwide since diverging 18 million years ago.
+Hundreds of millions of years of evolution across these species has produced a large, diverse pool of training alleles across species sharing a common physiology.
+As model input, we extracted 1,026 base pairs upstream of each gene’s translation start site.
+We held out maize as our test set and two closely related species as our validation set, training each architecture on the remaining Andropogoneae genomes.
+Within a panel of 26 maize lines, all architectures predict expression across genes moderately well but poorly across alleles.
+DanQ consistently ranked highest or second highest among all architectures yet performance was generally very similar across architectures despite orders of magnitude differences in size.
+This suggests that state-of-the-art supervised genomic deep learning models are able to generalize moderately well across related species but not sensitively separate alleles within species, the latter of which agrees with recent work within humans.
+We are releasing the preprocessed data and code for this work as a community benchmark to evaluate new architectures on our across-species and across-allele tasks.
diff --git a/docs/manuscript/content/01-introduction.typ b/docs/manuscript/content/01-introduction.typ
@@ -0,0 +1,30 @@
+= Introduction
+
+Non-coding regions of the genome are well-known to be as important as coding regions for understanding how genotype determines phenotype @bib-human-annotation-heritability @bib-maize-chromatin-heritability.
+Though tools like AlphaFold2 @bib-AlphaFold2 have dramatically improved our ability to study coding sequence, similarly performing tools do not yet exist for non-coding regions.
+Nevertheless, over the last decade deep learning models have rapidly improved performance in predicting non-coding genomic features such as chromatin accessibility @bib-Basenji @bib-a2z, transcription factor binding @bib-BPNet @bib-maize-kmer-grammar, and RNA abundance @bib-Borzoi @bib-Enformer directly from DNA sequence.
+These models can then be queried to highlight functional non-coding sites, which can be useful for filtering large sets of variants down to promising genome editing targets.
+Further, since most of the modeling work has been done on human and mouse data, there is a need to benchmark their performance in plants.
+
+Models that predict RNA abundance from sequence are particularly attractive due to the relatively cheap cost and standardized protocols of RNA-seq.
+However, there is room for improvement in these models across a number of areas.
+While RNA abundance models have shown high performance across genes, recent work in humans @bib-personal-transcriptome-variation has highlighted their lack of sensitivity across individuals.
+Some expression model architectures @bib-Enformer @bib-Borzoi include coding sequence in the input, which is known to lead to overfitting on gene family instead of true regulatory sequence differences @bib-hai-jacob-cnn.
+There is also a tendency to maximize data when training these models, without actually measuring the rate of diminishing returns for each additional observation.
+Finally, while multiple species have been included in some training sets, it is common to test on a set of held-out chromosomes within the training species, rather than testing on a completely held-out species.
+
+Deep learning models benefit from large and diverse training sets of different tissues and genotypes, which are rarely available outside model species.
+To train RNA expression models on larger sample sizes, we leveraged new long-read genomes and RNA-seq data from 15 wild species of the Andropogoneae tribe.
+Diverging around 17.5 million years ago @bib-andropogoneae-divergence, the Andropogoneae includes globally staple crop plants such as maize, sorghum, and sugarcane.
+Millions of years of evolution within the tribe has provided a large, diverse pool of training alleles.
+Sorghum and maize diverged around 12 million years ago (Mya), on the order of the human-chimpanzee split (6--10 Mya), but have a 10-fold higher rate of nucleotide divergence @bib-maize-sorghum-divergence @bib-human-chimp-divergence.
+
+We tested four sequenced-based genomic deep learning architectures, DanQ @bib-DanQ, HyenaDNA @bib-HyenaDNA, FNetCompression @bib-FNetCompression, and a smaller version of Enformer @bib-Enformer, on their ability to predict across species and alleles.
+DanQ is one of the earliest genomic deep learning architectures, leveraging a long short-term memory recurrent layer to learn the syntax and grammar of motifs detected by a convolutional layer.
+Enformer is a massive transformer architecture with a context size near 100 kilobases that is among the best performing models for human expression prediction.
+HyenaDNA is a novel architecture capable of handling long context windows of up to a million base pairs.
+FNetCompression combines a fast Fourier transform with multi-head attention to efficiently learn from sequences of up to 10 kilobases with a few orders of magnitude less parameters than the other architectures.
+
+We aimed to investigate, from a plant perspective, two major open questions in expression modeling from sequence:
+1) How well do current sequence-based deep learning architectures generalize across species?
+and 2) How sensitive are these models across individuals?
diff --git a/docs/manuscript/content/02-results.typ b/docs/manuscript/content/02-results.typ
@@ -0,0 +1,105 @@
+= Results
+
+#figure(
+  image(
+    "figures/methods.svg",
+    width: 80%
+  ),
+  caption: [
+    Methods overview:
+    a) data splits;
+    b) models, features, and targets;
+    c) orthogroup-guided splitting;
+    d) metrics (across- vs. within-gene performance)
+  ],
+) <fig_methods>
+
+== Current genomic deep learning architectures generalize across species
+
+We trained all four architectures on genomic sequence and RNA-seq data from 15 species within the Andropogoneae clade (@fig_methods#text[a]).
+Our validation set consisted of the two sampled species closest to _Zea mays_, _Tripsacum zopilotense_ and _Zea diploperennis_, all three of which fall within the Tripsacinae subtribe that diverged a few (0.6--4) million years ago @bib-andropogoneae-divergence @bib-portrait-genus.
+Our test set was the 26 inbred parents of the maize NAM population @bib-NAM @bib-NAM-genomes, held out until hyperparameters were frozen.
+As input, we extracted 1,026 base pairs upstream of the translation start site to match HyenaDNA's "tiny" configuration (@fig_methods#text[b]).
+We trained and evaluated all architectures on two regression tasks, maximum expression across tissues and absolute expression in leaf, as well as two classification tasks, expressed in any tissue and expressed in leaf (@fig_methods#text[b]).
+
+#figure(
+  image(
+    "figures/performance_across.svg",
+    width: 60%
+  ),
+  caption: [
+    a--d)
+    Model performance across all genes and data splits.
+    Each subfigure shows the performance of all architectures on a single task.
+    Error bars represent one standard deviation from the mean in each direction.
+  ],
+) <fig_performance_across>
+
+Though benchmarking of sequence-based models has been done within humans @bib-benchmark-human and across species in the training set @bib-Basenji @bib-FloraBERT, there has been little evaluation on entirely held out species.
+To establish a baseline in plants, we measured performance of all architectures, tasks, genes, and data splits (@fig_performance_across#text[a--d]).
+Rankings by Spearman correlation on the test set are inconsistent, except that DanQ performed the best or tied closely with the best across all tasks.
+Remarkably, DanQ performs only slightly worse (@fig_performance_across#text[a;] $Delta r = 0.09$) than Enformer in a recent within-human single tissue benchmark @bib-personal-transcriptome-variation despite predicting on an unobserved species.
+Despite having moderate Spearman and Pearson correlation (@fig_performance_across_prsn), DanQ's predictions on the test set are still underwhelming (@fig_DanQ_pred).
+We observed test set auROC scores in the any expression task slightly lower than previous results @bib-hai-jacob-cnn on promoter expression classification models trained and tested only within maize.
+Taken together, these results support modern genomic deep learning architectures are capable of generalizing almost as well across closely related species as they do within species.
+
+== Data quantity matters more than composition for modeling RNA abundance across species
+
+#figure(
+  image(
+    "figures/performance_within_ablation.svg",
+    width: 95%
+  ),
+  caption: [
+    a)
+    Validation set performance of DanQ on the maximum expression task across varying training set sizes and compositions.
+    Points on the lines are mean Pearson correlation across replicate training runs.
+    The standard deviation across replicates is shaded around the mean line.
+    The exponent scale of the bottom axis is denoted in bold on the right.
+    b--e)
+    Distributions of model performance within orthogroups for each task.
+    Architectures are sorted from highest (left) average within-orthogroup performance to lowest (right).
+    Bars within the violins represent the mean of the distribution.
+  ],
+) <fig_performance_within_ablation>
+
+Despite the growing number of plant genomes with transcriptomic data @bib-plant-gene-atlas, each genome added to the training set increases training time and may give diminishing returns.
+We measured changes in DanQ's performance on progressively larger fractions of the training data, iteratively adding sequences from a set of genomes or randomly from across all training genomes.
+Pearson correlation on the validation set rises until approximately 200,000 observations when it begins to show diminishing returns for larger training set sizes (@fig_performance_within_ablation#text[a]).
+However, the slope remains positive between the half size and full size data points, suggesting room for improvement with further observations.
+Comparing iteratively adding whole genomes to randomly sampling an equivalent number of alleles from the entire training set, there are only substantial differences when using less than 8 genomes, with random performing worse than 4 whole genomes.
+The ablation results clearly support the use of further data to achieve higher performance across genes, which can come from sequencing additional related species.
+
+== Current architectures poorly generalize across individuals of an inbred maize panel
+
+Recent work @bib-personal-transcriptome-variation has shown that current models poorly explain expression variation across individuals.
+Since our test set is a collection of maize alleles with an order of magnitude more diversity than humans @bib-maize-HapMap2, we looked at the distribution of test set performance within each orthogroup and expected to see similarly low or even lower performance.
+We only considered orthogroups that had at least 20 orthologs to have sufficient sample size for calculating correlation or auROC.
+We saw much lower average within-orthogroup Spearman and Pearson correlations as well as auROC compared to the global across-gene metrics, except for the any expression task (@fig_performance_within_ablation#text[b--e;] @fig_performance_within_prsn), which also shows clear differences between architectures.
+The average within-orthogroup Spearman correlation in the single tissue regression task is double ($r = 0.092$) than what was observed with Enformer ($r = 0.045$) in humans.
+
+#figure(
+  image(
+    "figures/interpretation.png",
+    width: 95%
+  ),
+  caption: [
+    Left: Predicted versus observed $attach(log, br: 10)$ expression change in leaf between all NAM ortholog promoter allele pairs within an orthogroup.
+    Percentages in the middle of each quadrant display the proportion of non-zero data points in that quadrant.
+    Right: Saliency map for DanQ trained on maximum expression task.
+    The mean across all B73 genes is plotted as a line, with a single standard deviation shaded above and below.
+  ],
+) <fig_interpretation>
+
+As an alternative allele-level comparison, we also looked at how well DanQ predicted expression differences between all pairs of maize ortholog promoter alleles within a orthogroup.
+We observed a general positive relationship between the two, but there is still quite a bit of noise in the predictions (@fig_interpretation, left).
+The Pearson and Spearman correlation coefficients between the observed and predicted fold changes were only 0.22 and 0.08, respectively.
+Strikingly, pairs of orthogroups that are two orders of magnitude apart in expression level are still sometimes predicted with the incorrect direction.
+Despite current architectures showing promising across gene performance in unobserved species, they still struggle across shorter evolutionary timescales, similar to what was seen in humans @bib-personal-transcriptome-variation.
+
+== The maximum expression regression model focuses on the core promoter region
+
+Based on theory and prior interpretation work on expression models @bib-AgroNT, we hypothesized our expression models would also pay most attention to the region surrounding the transcription start site.
+Looking at the average saliency map for DanQ across all B73 genes on the maximum expression task we see that DanQ indeed focuses on the core promoter region and the 5' UTR (@fig_interpretation, right).
+There is relatively high variance in the gradient around the transcription start site, taping off with increasing distance, though decaying slower in the UTR than promoter.
+This hyperfocus on the core promoter region could be why DanQ and other architectures struggle to distinguish expression differences in highly related sequences, since functional mutations are less likely to accumulate in this highly constrained region.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		results//.ipynb filter=lfs diff=lfs merge=lfs -text
		results//figures/.png filter=lfs diff=lfs merge=lfs -text