Skip to content
dlo edited this page Dec 4, 2014 · 6 revisions

Goals

  • Handle most edge cases
  • Use as many of Git's internals as possible (and avoid introducing new command vocabulary!)
  • Allow for "prettified" views for specific filetypes (e.g., for images, use something like jp2a to turn to ascii as to view in source control)
  • Future-proof
  • No symlinks
  • As little configuration as possible

General Idea

When you stage a file into Git from your local filesystem, we only need the file's hash to retrieve that file from a remote backend (like S3 or Rackspace Cloud Files). We can't depend upon the Git hash, since Git is only going to store the SHA of the file as Git sees it (i.e., after it's cleaned).

Overview

git notes are the perfect conduit for tracking the information we care about, since 1) they can be attached to arbitrary Git data and 2) don't interfere with the working tree. bigstore will attach notes to the blobs of the cleaned objects.

Here's what we'll be tracking:

  • Who uploaded what file to what backend (s3, rackspace), and when.
  • Who downloaded what file when.

I'll go over each command in further detail below.

git bigstore init

Set up the repo and check if any remote notes already exist.

git bigstore filter-clean

Read the first line of the file. If the line is just "bigstore", the file is already cleaned. Just output stdin.

If it needs to be cleaned, calculate the hash of the file's contents, and output the following to stdout:

bigstore
sha256
96e31e44688cee1b0a56922aff173f7fd900440f

Copy the object's contents to ".git/bigstore/objects/96/e31e44688cee1b0a56922aff173f7fd900440f".

git bigstore filter-smudge

Read the first line of the file. If the line is just "bigstore", read the next line, which should be a hashing function name followed by a dollar sign and then the hash itself.

Assume the hash is "96e31e44688cee1b0a56922aff173f7fd900440f". We then check if a file exists at ".git/bigstore/objects/96/e31e44688cee1b0a56922aff173f7fd900440f". If not, just output stdin to stdout. If it does exist, output the contents of this file to stdout.

git bigstore push

The key here is to avoid pushing something that already exists on the remote backend.

Fetch bigstore notes from remote and merge in if necessary.

git fetch origin refs/notes/bigstore:refs/notes/bigstore-remote
git notes --ref=bigstore merge -s cat_sort_uniq refs/notes/bigstore-remote

For each file that needs to be uploaded (go through all files referenced by .gitattributes--we will parse the file to see which ones have the bigstore filter), check the corresponding note to see if there is an entry indicating the file has been uploaded to the backend we want to push to.

(optional) Check the remote API to see if the file exists there, just to be sure.

Upload the file. Once this is completed, append the action taken to the blob's notes. For an initial upload to Amazon S3, this might look like:

git notes --ref bigstore append a2d55dfc4acfc3c887a79cc1a6b24b6819415873 -m "1365047350.708227	upload	s3	Dan Loewenherz <[email protected]>"

Now update the notes remotely.

git push refs/notes/bigstore:refs/notes/bigstore

git bigstore pull

This command takes no arguments. The goal is to sync everything--i.e., download everything stored remotely.

Data integrity here is really important, so we have to run commands in a way that prevent useless overwriting of data or weird data corruption. I think the following steps will make this happen.

First, fetch bigstore notes from remote and merge them locally.

git fetch origin refs/notes/bigstore:refs/notes/bigstore-remote
git notes --ref=bigstore merge -s cat_sort_uniq refs/notes/bigstore-remote

Go through all refs referred to in the .gitattributes file. Check if the relevant hashes are in .git/bigstore/objects. If not, download the file from the remote. For every download, append something like the following to the object's notes. In this case, a2d55dfc4acfc3c887a79cc1a6b24b6819415873 is the object's SHA.

git notes --ref bigstore append a2d55dfc4acfc3c887a79cc1a6b24b6819415873 -m "1365048117.588694	download	s3	Dan Loewenherz <[email protected]>"

When all downloads are completed, update the remote notes.

git push refs/notes/bigstore:refs/notes/bigstore

git bigstore log

Displays a history of the given path. We first go through the Git history of the path, and pull out the blob for each revision. Then, for each blob, we parse the notes and display them in a pretty way.

$ git bigstore log my/static/image.png
added, pushed to s3 (11 hours ago) <Dan Loewenherz>
pulled from s3 (8 hours ago) <Somebody Cool>
pulled from s3 (2 hours ago) <Another Person>
updated, pushed to s3 (11 hours ago) <Dan Loewenherz>

Here's how we get all the relevant blobs for the path.

$ git log --pretty=format:'%T' "$1" \
  | while read TREE; do
    git ls-tree -r $TREE | grep -G "	$1$" | awk '{ print $3 }'
  done;

To get the relevant notes.

$ git notes --ref bigstore show $SHA