diff --git a/README.md b/README.md index 275481a..215616a 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,158 @@ A toolkit for working with FASTQ files, written in Rust. +Currently `fqtk` contains a single tool, `demux` for demultiplexing FASTQ files based on sample barcodes. +`fqtk demux` can be used to demultiplex one or more FASTQ files (e.g. a set of R1, R2 and I1 FASTQ files) with any number of sample barcodes at fixed locations within the reads. +It is highly efficient and multi-threaded for high performance. + +Usage for `fqtk demux` follows: + +```console +Performs sample demultiplexing on FASTQs. + +The sample barcode for each sample in the metadata TSV will be compared against +the sample barcode bases extracted from the FASTQs, to assign each read to a +sample. Reads that do not match any sample within the given error tolerance +will be placed in the ``unmatched_prefix`` file. + +FASTQs and associated read structures for each sub-read should be given: + +- a single fragment read (with inline index) should have one FASTQ and one read + structure +- paired end reads should have two FASTQs and two read structures +- a dual-index sample with paired end reads should have four FASTQs and four read + structures given: two for the two index reads, and two for the template reads. + +If multiple FASTQs are present for each sub-read, then the FASTQs for each +sub-read should be concatenated together prior to running this tool (e.g. +`zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`). + +Read structures are made up of `` pairs much like the `CIGAR` +string in BAM files. Four kinds of operators are recognized: + +1. `T` identifies a template read +2. `B` identifies a sample barcode read +3. `M` identifies a unique molecular index read +4. `S` identifies a set of bases that should be skipped or ignored + +The last `` pair may be specified using a `+` sign instead of +number to denote "all remaining bases". This is useful if, e.g., fastqs have +been trimmed and contain reads of varying length. Both reads must have template +bases. Any molecular identifiers will be concatenated using the `-` delimiter +and placed in the given SAM record tag (`RX` by default). Similarly, the sample +barcode bases from the given read will be placed in the `BC` tag. + +Metadata about the samples should be given as a headered metadata TSV file with +two columns 1. `sample_id` - the id of the sample or library. 2. `barcode` - the +expected barcode sequence associated with the `sample_id`. + +The read structures will be used to extract the observed sample barcode, template +bases, and molecular identifiers from each read. The observed sample barcode +will be matched to the sample barcodes extracted from the bases in the sample +metadata and associated read structures. + +An observed barcode matches an expected barcocde if all the following are true: + +1. The number of mismatches (edits/substitutions) is less than or equal to the + maximum mismatches (see --max-mismatches). +2. The difference between number of mismatches in the best and second best + barcodes is greater than or equal to the minimum mismatch delta + (`--min-mismatch-delta`). The expected barcode sequence may contains Ns, + which are not counted as mismatches regardless of the observed base (e.g. + the expected barcode `AAN` will have zero mismatches relative to both the + observed barcodes `AAA` and `AAN`). + +## Outputs + +All outputs are generated in the provided `--output` directory. For each sample +plus the unmatched reads, FASTQ files are written for each read segment +(specified in the read structures) of one of the types supplied to +`--output-types`. + +FASTQ files have names of the format: + +{sample_id}.{segment_type}{read_num}.fq.gz + +where `segment_type` is one of `R`, `I`, and `U` (for template, barcode/index +and molecular barcode/UMI reads respectively) and `read_num` is a number starting +at 1 for each segment type. + +In addition a `demux-metrics.txt` file is written that is a tab-delimited file +with counts of how many reads were assigned to each sample and derived metrics. + +## Example Command Line + +As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index +reads both reading a sample barcode, as well as an in-line 8bp sample barcode in +read one, the command line would be: + +fqtk demux \ + --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \ + --read-structures 8B92T 8B 8B 100T \ + --sample-metadata metadata.tsv \ + --output output_folder + +Usage: fqtk demux [OPTIONS] --inputs ... --read-structures ... --sample-metadata --output + +Options: + -i, --inputs ... + One or more input fastq files each corresponding to a sequencing (e.g. R1, I1) + + -r, --read-structures ... + The read structures, one per input FASTQ in the same order + + -b, --output-types ... + The read structure types to write to their own files (Must be one of T, B, + or M for template reads, sample barcode reads, and molecular barcode reads) + + [default: T] + + -s, --sample-metadata + A file containing the metadata about the samples + + -o, --output + The output directory into which to write per-sample FASTQs + + -u, --unmatched-prefix + Output prefix for FASTQ file(s) for reads that cannot be matched to a sample + + [default: unmatched] + + --max-mismatches + Maximum mismatches for a barcode to be considered a match + + [default: 1] + + -d, --min-mismatch-delta + Minimum difference between number of mismatches in the best and second best barcodes + for a barcode to be considered a match + + [default: 2] + + -t, --threads + The number of threads to use. Cannot be less than 3 + + [default: 8] + + -c, --compression-level + The level of compression to use to compress outputs + + [default: 5] + + -S, --skip-reasons + Skip demultiplexing reads for any of the following reasons, otherwise panic. + + 1. `too-few-bases`: there are too few bases or qualities to extract given the + read structures. For example, if a read is 8bp long but the read structure + is `10B`, or if a read is empty and the read structure is `+T`. + + -h, --help + Print help information (use `-h` for a summary) + + -V, --version + Print version information +``` + ## Installing ### Installing with `conda`