Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug for header field getting malformed when it extends over multiple lines #111

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions tests/test_http.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,28 @@ def test_headers_raw_dict_none(self):
self.assertIsNone(headers_dict_to_raw(None))

def test_headers_raw_to_dict(self):
raw = b"Content-type: text/html\n\rAccept: gzip\n\r\
Cache-Control: no-cache\n\rCache-Control: no-store\n\n"
dct = {b'Content-type': [b'text/html'], b'Accept': [b'gzip'],
raw = b'\r\n'.join((b"Content-type: text/html",
b"Accept: gzip",
b"Cache-Control: no-cache",
b"Cache-Control: no-store"))
dct = {b'Content-type': [b'text/html'], b'Accept': [b'gzip'],
b'Cache-Control': [b'no-cache', b'no-store']}
self.assertEqual(headers_raw_to_dict(raw), dct)

def test_headers_raw_to_dict_multiline(self):
raw = b'\r\n'.join((b'Content-Type: multipart/related;',
b' type="application/xop+xml";',
b'\tboundary="example"',
b'Cache-Control: no-cache'))
dct = {
b'Content-Type': [
b'\r\n'.join((b'multipart/related;',
b' type="application/xop+xml";',
b'\tboundary="example"'))
],
b'Cache-Control': [b'no-cache']}
self.assertEqual(headers_raw_to_dict(raw), dct)

def test_headers_dict_to_raw(self):
dct = OrderedDict([
(b'Content-type', b'text/html'),
Expand Down
12 changes: 11 additions & 1 deletion w3lib/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,17 @@ def headers_raw_to_dict(headers_raw):

if headers_raw is None:
return None
headers = headers_raw.splitlines()

headers = []
for line in headers_raw.split(b'\r\n'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit line boundaries to \r\n?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that the specification (RFC 2616 cited above or RFC7230 which appears more recent) states that in HTTP messages, lines are folded specifically by CRLF (\r\n) and not by LF (\n). Not following this had lead to an issue where a header field value containing a LF (\n) was accidentally interpreted as a multiline field value. Without leading at least one space or tab char in the following line, however, the header ended up being interpreted as malformed and the lines were simply discarded when parsed out.

This is rather pathological and use of LF within a field value feels like a bad idea, but it doesn't appear to violate the specification so some HTTP messages with that structure exist.

Copy link
Member

@Gallaecio Gallaecio Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave .splitlines() nonetheless, though, given the current information we have, to be lenient with servers that may not follow the standard.

If later it turns out by allowing certain forms of line break we are misinterpreting header values, then we can change this accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that the function headers_raw_to_dict take in an optional parameter strict with the default value that does not change the existing behavior. This way, the code works as is, but would be ready for those who want a stricter interpretation of the HTTP standard via a simple switch.

I think exposing this switch to the users requires a similar change at the user-facing API, but that should be done in a separate PR, if desired.

if line.startswith(b' ') or line.startswith(b'\t'):
try:
headers[-1] += (b'\r\n' + line)
except IndexError:
raise ValueError('Malformed raw headers')
else:
headers.append(line)

headers_tuples = [header.split(b':', 1) for header in headers]

result_dict = {}
Expand Down