Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte Order Mark characters included in output? #33

Open
BillyTom opened this issue Mar 20, 2014 · 1 comment
Open

Byte Order Mark characters included in output? #33

BillyTom opened this issue Mar 20, 2014 · 1 comment

Comments

@BillyTom
Copy link

The csv-file I am importing is encoded in UTF-8 and thus it startes with the byte order "EF BB BF" or "" when decoded. (see http://de.wikipedia.org/wiki/Byte_Order_Mark)

These are non-print characters and generally don't show up in the output. However, it can make a difference if you are making a string comparison.

For example, my first column in the first row looks like this:

array(12) {
  [0]=>
  string(16) "location-ID"
  [1]=>
  string(5) "value"
  [2]=>
    ...

As you can see the character count is a bit off because of the non-print-characters. Other columns are not affected. Only the very first column in the very first row shows this behaviour.

I've tried several different config-options (->setToCharset('UTF-8') etc.) in order to quash those unwanted characters, but none did work.

My csv-file contains several special characters like äöü or ß which are all displayed correctly, so I am positive that the input is decoded correctly.

It is not a big deal to manually remove those unwanted characters in the interpreter, but I was wondering if this was a bug in goodby/csv.

@judgej
Copy link

judgej commented Apr 17, 2014

In a similar fashion, I am looking for support to generate the BOM characters when exporting. Those three characters seem to be the only way to tell MS Excel what encoding the file uses. I'll raise it as a separate issue when I have more details, but just noting it here so it does not get lost. To export setFromCharset() could be used to set what the (optional) BOM looks like and should not need to be paired up with a setToCharset() if no conversion is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants