Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code to preserve original linefeeds (issue #121) #131

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

techtonik
Copy link
Contributor

This is pseudo code - it may work or may not. I haven't got chance to test it yet

@techtonik techtonik mentioned this pull request Oct 19, 2015
@takluyver
Copy link
Contributor

Nice, I like this idea

# detect linefeeds
lineends = {'\n':0, '\r\n':0, '\r':0}
lines = []
for line in open(filename, 'rb'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this work the same way on Python 2 and 3, use:

io.open(filename, newline='')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But file needs to be opened in binary more or else the information about lineends will be lost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the newline='' means it will do no conversion of line endings. Binary mode has a much bigger effect on Python 3, because it means you're dealing with bytes rather than strs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But binary mode guarantees that Python 3 won't bail out with UnicodeDecodeError. How to address that? I can't know the file encoding before opening it as Python 3 requires.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. There are ways around that, like using errors='ignore' to skip characters that can't be decoded, but it may be easier to use binary mode here. In that case, though, you'll need to update the checks below to use bytes, e.g. if line.endswith(b'\r\n').

@techtonik
Copy link
Contributor Author

Well, Python 3 is a pain. But it should work for now on Python 2. I may be able to finish this later.

newline = [x for x in lineends if lineends[x] != 0][0]
if os.linesep != newline:
with open(filename, 'wb') as f:
for line in lines:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using the lines that you read before the file was rewritten, so this will undo the changes modernize made. You need to base it on new_text, or re-read the file after it is changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Fixed in 7787615

@techtonik
Copy link
Contributor Author

I am not sure I can work on a test. Not today. Basically, the test need to create two files - one with LF linefeeds and another with CRLF. And after fix is applied, check that linefeeds didn't change. So either produce 4 files (input and output) or copy/paste the function being tested to the test case (which is no good). The files are also can not be committed to Git, because it messes with linefeeds.

@takluyver
Copy link
Contributor

I added machinery for testing line endings in #130 (actually, it's a modification of @daira's machinery in #129). We can reuse that for this, although the actual tests will need to be slightly different.

CRLF = '\r\n'
CR = '\r'
else:
LF = bytes('\n', encoding='ascii')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the b prefix, like b'\n'. It's valid syntax on Python 2.6 and above, which is what we support. And then you don't need an if/else, because on Python 2, b'\n' == '\n'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@daira
Copy link
Contributor

daira commented Oct 20, 2015

This isn't correct as-is, but I like the basic approach. I will work on it tomorrow.

@graingert
Copy link
Member

@techtonik I think this feature would be best in https://github.com/jreese/fissix

@techtonik
Copy link
Contributor Author

@graingert I see this project became derived from fissix, which need more docs to understand what is latest lib2to3 and what are enhancements.

@graingert
Copy link
Member

@techtonik not much has changed, but you can see what enhancements are applied here:
https://github.com/jreese/fissix/blob/main/scripts/update.sh

Everything goes via black + a git merge from CPython master

@jreese maybe a changelog will help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants