This is the repository for the 2022-2023 Harvey Mudd Clinic Team in collaboration with Harvard Center for Computational Biomedicine.
- Description
- Getting Started
- Dependencies
- Assignment 1
- Assignment 2
- Assignment 3
- Assignment 4
- Assignment 5
- Assignment 6
- Authors
- Acknowledgements
At Harvard CCB, researchers are pioneering the study of various biological and spatial genomic datasets using computational methods. These high-resolution biological datasets collected using imaging techniques can be quite large. Most workflows involve mainly Python and R, which cannot be effectively used to analyze such memory-intensive datasets. We aim to leverage relational database queries in SQL to improve scalability, add flexibility to analyze larger datasets, and eventually find additional underlying spatial relationships in the original data.
To run our scripts and follow along with our process, you'll need to have the following installed.
- Python
- Some Python packages:
- pandas
- tqdm
- Azure Data Studio
- Git
- Docker
Assignment 1 is an introduction to SQL Server consisting of a Coursera course on Relational Databases and a few corresponding exercises.
For a breakdown of each step in assignment 1, see the assignment 1 README.
Assignment 2 focuses on a few exercises with queries in SQL Server in order to gain practice in using the tools we learned about in assignment 1. The assignment uses some flight data and asks us to use queries to find information such as which plane logged the most flight miles.
For a breakdown of each step in assignment 2, see the assignment 2 README.
Assignment 3 consists of two subtasks: the first to read and present on recent reviews in spatially-resolved omics profiling, and the second to practice working with spatial omics data in SQL Server. This repository will focus only on the second subtask.
For a breakdown of each step in this subtask of assignment 3, see the assignment 3 README.
Assignment 4 serves as a transition into working with spatial data. We are tasked with analyzing two tables: one containing weather data along iwht latitude and longitude of the weather station, and one containing geographical information. Our goal was to answer questions such as the windiest stations in Massachusetts, or the rainiest statin in Washington, by performing spatial intersect queries on the tables.
For a breakdown of each step of assignment 4, see the assignment 4 README.
Assignment 5 finally brings our attention to spatial transcriptomics data in SQL Server. We are given multiple subtasks, such as creating a new gene-cell-molecule count table, reshaping that table into a gene expression matrix, and creating convex hulls around every molecule in a given cell.
For a breakdown of each step of assignment 5, see the assignment 5 README.
You may also follow along in our assignment 5 notebook.
Assignment 6 is a continuation of the ideas of Assignment 5, but with a significantly larger dataset of tissue images from 26 mice hypothalamuses. This dataset is currently not publically available but was provided for our use. With this larger dataset, we repeated the objectives of Assignment 5 on an institutional computer cluster: we created a molecule count table, and generated convex hulls around molecules belonging to cells in the first z-slice.
For a breakdown of each step of assignment 6, see the assignment 6 README.
You may also follow along in our assignment 6 notebook.
Tim Buchheim
Ludwig Geistlinger
Robert Gentleman
Rafael Goncalves
Tyrone Lee
Jeffrey Moffitt
Nathan Palmer
Sunil Poudel
Sam Pullman
Chris Stone