Keep fastq id #2

ilveroluca · 2014-07-03T10:36:21Z

I wanted to propose this change to the other developers. There are a couple of little problems that emerge when using FastqLoader and FastqStorer to manipulate fastq files whose id lines don't follow the Illumina format:

there's no way to retrieve the original id line from the fastq records being read;
there's no way to write an arbitrary fastq id line either.

An example where this an issue in practice, I wanted to convert base quality values from Illumina to Sanger encoding in a few gigabytes of fastq files provided by some sequencing center. The id lines were not in the "structured" Illumina format so I had no way to manipulate the data while keeping the same read ids. With this patch that's now possible.

For the most part, this change is backwards compatible. It adds a new field "id" to the tuples generated by FastqLoader, which can be easily ignored. There is one small thing though that might cause some surprises: if the "id" field exists in the tuple passed to FastqStorer, it will be used as the id line rather than constructing one from the meta data ("id" will be passed to FastqOutputFormat as the key, rather than passing null). I suspect this isn't a big issue, but I wanted to get some feedback before committing.

AndreSchumacher · 2014-07-22T20:17:32Z

@ridvandongelci did you have a chance to look at this?

ridvandongelci · 2014-07-23T06:54:23Z

Yes, It is ok for me. I was wondering what do you think.

AndreSchumacher · 2014-07-24T06:10:08Z

It looks good to me except that previously we had Fastq and Qseq loader to use the same schema which simplified the manual (e.g., 5.2.3 FastqLoader and QseqLoader).

@ilveroluca any chance we could keep both loader/storer pairs similar by adding this to Qseq, too?

ilveroluca · 2014-07-24T09:02:24Z

Ah...I didn't think of that. I could change the schemas used by QseqLoader
and QseqStorer. The question is, what should "id" mean for the qseq
format? It doesn't exist there.

An option would be to have QseqLoader create an artificial id by joining
the coordinate values (the first columns, up to the read number).
QseqStorer on the other hand would have to ignore the value and throw it
out.

I don't know...it feels like a wart. What if we argued that people should
be projecting their data anyways? :-)

On 24 July 2014 08:10, Andre Schumacher [email protected] wrote:

It looks good to me except that previously we had Fastq and Qseq loader to
use the same schema which simplified the manual (e.g., 5.2.3 FastqLoader
and QseqLoader).

@ilveroluca https://github.com/ilveroluca any chance we could keep both
loader/storer pairs similar by adding this to Qseq, too?

—
Reply to this email directly or view it on GitHub
#2 (comment).

AndreSchumacher · 2014-07-25T05:39:02Z

OK, let's not then shoehorn this onto the Qseq if there is not ID. @ilveroluca if you could just update doc/seqpig_reference.tex to break out the Qseq loader/storer part and copy-paste that to its own section so that the manual stays accurate. Thanks!!

ilveroluca added 2 commits July 3, 2014 12:12

Return original id line with fastq tuple

c5bdb60

If present in the tuple, use the "id" field as the fastq id line

6aa96b1

ilveroluca added 2 commits August 7, 2014 10:09

Update comment

6f7b3e8

Update SeqPig reference to document new fastq "id" field

3da8b77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep fastq id #2

Keep fastq id #2

ilveroluca commented Jul 3, 2014

AndreSchumacher commented Jul 22, 2014

ridvandongelci commented Jul 23, 2014

AndreSchumacher commented Jul 24, 2014

ilveroluca commented Jul 24, 2014

AndreSchumacher commented Jul 25, 2014

Keep fastq id #2

Are you sure you want to change the base?

Keep fastq id #2

Conversation

ilveroluca commented Jul 3, 2014

AndreSchumacher commented Jul 22, 2014

ridvandongelci commented Jul 23, 2014

AndreSchumacher commented Jul 24, 2014

ilveroluca commented Jul 24, 2014

AndreSchumacher commented Jul 25, 2014