Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping Lines without side effects #20

Open
JoachimTuchel opened this issue Jan 19, 2021 · 3 comments
Open

Skipping Lines without side effects #20

JoachimTuchel opened this issue Jan 19, 2021 · 3 comments

Comments

@JoachimTuchel
Copy link

JoachimTuchel commented Jan 19, 2021

I am not sure if the change I am suggesting here is going beyond CSV Reading. I will explain anyways, since it is not a big change to NeoCSVReader but made my life a lot easier.

First, let me explain the problem:
We need to read transactions from a number of sources that claim to export CSV data. In fact, most of the sources export a body of CSV data augmented with a few header lines and some also add a few lines to the end of the file. This is mostly information that displays nicely in Excel, and may or may not be CSV at all. Most of the time it is not, or it has another number of columns than the "payload CSV" (I'll stick to this term from now on, and I mean the real CSV data in the file.

In most cases, all we know is the number of header lines and the number of footer lines after the payload. The total number of lines in the file is of course variable.
What sounds easy, is easy in many cases. Open a ReadStream, ignore the first X lines and start Parsing CSV ... until we are Y lines before the end of the file....

And this is where things get complicated.
Skipping a line sounds easy, but in CSV, you may have quoted fields which contain a line break. So you cannot simply use a Stream's #nextLine. It would cut a line off at every crlf or lf or cr. This leads to a total mess.

NeoCSVReader has #readHeader, which handles this quoted stuff quite easily. But #readHeader has a side effect: it sets the fieldCoun based on the number of fields it found in a line.

So what I needed was a way to skip a line, respecting possible line breaks in quoted fields and do nothing else.

It turns out that is easy. I extracted just the reading part from #readHeader into its own method and I use it from #readHeader but also for my skipping purposes.

So I changed #readHeader to:

readHeader
	"Read a record, presumably a header and return the header field names.
	This should normally be called only at the beginning and only once.
	This sets the fieldCount (but fieldAccessors overrides fieldCount)."

	| names |
	names := self readNextLineAsArray.
	fieldCount := names size.
	^names

by extracting the Array streamContents: part to its own method named #readNextLineAsArray:

readNextLineAsArray
	^Array streamContents: 
			[:out | 
			[self atEnd or: [self readEndOfLine]] whileFalse: 
					[out nextPut: self readField.
					(self readSeparator and: [self atEnd or: [self peekEndOfLine]]) 
						ifTrue: [out nextPut: emptyFieldValue]]]

So now I have a way to read any line in a Reader's stream, and I can use the contents or ignore them.
There is just one more thing I had to do to make my case work: I needed a way to let NeoCSVReader reset the position of its Stream:

resetStream
        "go back to psotion 0 of the readStream, e.g. when parsing the Stream in a two-pass process"
	readStream reset.

These three changes to NeoCSVReader allowed me to parse an only partially conforming CSV file and skip the header and footer lines correctly. Here is some pseudo-code for this import job:

importFrom: stream
       "this is more or less pseudo code"

	reader := self csvReader on: stream.
	
	"Count the number of lines in the Stream, taking into account quoted fields with crlf inside 
	This is not a good solution for huge files, of course"
	
	noOfLines := 0.
	[reader atEnd] whileFalse: [
            reader readNextLineAsArray. "We don't care about the contents in this case" 
            noOfLines := noOfLines +1].
	reader resetStream.
	
	"Ignore the header lines"
	self noOfHeaderLines timesRepeat: [reader readNextLineAsArray].

	noOfPayloadLines := noOfLines - (self noOfHeaderLines ifNil: [0])- (self noOfFooterLines  ifNil: [0]).



	resultObjects := OrderedCollection new.
	
	[noOfPayloadLines timesRepeat: [resultObjects add: reader next]  ]
		on: Error
		do: [:ex |
			Error signal: 'CSV import failed with: ' , ex description.
			ex exitWith: nil].
 

What do you think about these changes? Do they make sense for more than my case? Would it make sense to include such an extension to NeoCSVReader or is this too much of noise around CSV parsing?

@svenvc
Copy link
Owner

svenvc commented Jan 20, 2021

Hi,

Thx for the detailed report.
I am very busy for the rest of the month, but I will get back to this issue later.

Sven

@svenvc
Copy link
Owner

svenvc commented May 13, 2021 via email

@svenvc
Copy link
Owner

svenvc commented May 13, 2021

There is now the following commit:

0acc227

which should address a couple of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants