A way to get the row/column of a Reader #109

jonas-schievink · 2018-01-23T23:45:40Z

xml-rs provides TextPosition for this, while quick-xml only has Reader::buffer_position() to get the byte position. It would be useful for error reporting if there was a way to compute the row and column from this (preferably without overhead if unused).

The text was updated successfully, but these errors were encountered:

jonas-schievink · 2018-01-24T21:01:20Z

This could be implemented somewhere along these lines:

impl<B: Seek + BufRead> Reader<B> {
    fn row_col(&self) -> (usize, usize) {
        // seek to 0, count lines until `buffer_position` is reached
    }
}

(add a proper return type and a better name to taste)

Of course, this is pretty slow as it re-reads the entire content.

tafia · 2018-01-30T05:30:14Z

This looks like a good solution.
With probably a function renaming to show that this is very expensive.

tafia · 2018-03-01T08:26:22Z

Out of curiousity, @jonas-schievink do you plan on doing a PR?

jonas-schievink · 2018-03-01T15:11:37Z

Not right now, no. If anyone wants to implement this, feel free!

DanielKeep · 2018-03-16T05:43:57Z

I implemented a PositionRead wrapper type that builds a table for mapping byte offsets to line:column positions transparently, then added some extra methods to Reader for doing the lookup. That way, you don't pay anything if you don't use it.

That works fine; the problem is that buffer_position (which I was using to do the lookup) is rubbish for this. Depending on the exact state of the parser, it can end up pointing to before the event, at the start of the event, or even inside of it. That's to say nothing of errors where you just don't have a byte offset to do lookup on.

As such, I've had to abandon quick-xml for my current project. I need to be able to report back to users on where problems are, and I don't have time to do open-heart surgery on quick-xml to improve offset tracking.

I've uploaded my work-in-progress if anyone has a use for it.

tafia · 2018-03-16T06:00:09Z

Thanks for the report.
The buffer_position may indeed need some rework. Do you have some particular tests that are failing at hand?

DanielKeep · 2018-03-16T06:10:25Z

test_multiline_pos in tests/unit_tests.rs was where I more or less threw up my hands. Before the second event, the Reader reports a buffer position of 10, which puts it after the opening < of the <?xml ...

The test_start_end_comment_pos test also has a fudge in it, where the position of the third event should be 1:27, but gets reported as 1:26 because the parser is still at the start of the whitespace before the <a /> element.

At a bare minimum, I think Reader would have to start tracking "last event position" separately from buffer (parsing) position. Maybe also use it for the position of errors. To do it properly, it would also have to start attaching correct positions to things that don't get distinct events (i.e. if there's an error in an attribute value, where is it?), but at that point, we're talking possibly significant API changes.

tafia · 2018-03-16T06:31:02Z

Great. Thank you

tafia · 2018-03-22T05:58:44Z

@DanielKeep I have created #123 which fixes an offset by one for buffer_position. On your branch, the 2 basic pos tests are passing but not the multiline_pos. I have tried to fix it but I think the errors come either from your test or from your changes.

I understand that you may not have time to spend on it but in case, I'll let it open for the moment and merge it later.

Thanks again for your feedback!

EDIT: All your comments about a proper way to manage buffer position are still valid. This is just a fix for the current situation. I'll try to find a way to explicitly manage positions (probably before managing row/columns).

bbigras · 2018-10-04T21:13:02Z

Any progress on this?

Themayu · 2023-02-11T00:26:16Z

I've begun work on something similar, which can possibly be used as a building block for this - adding Span objects to the different quick_xml::event::Bytes* structs to get where they are located in the underlying data source, if applicable. This might be usable as a base to compute the line and column of that event from.

dralley · 2023-02-11T05:30:09Z

That sounds like a decent improvement. Separately on the question of calculating a line + column, it may not be terribly expensive to do SIMD-accelerated passes over the buffer continuously to keep track of how many lines have been seen previously. That would enable calculating the line + column without needing to "Seek" or have the entire document buffered, as you could start counting from the beginning of the current buffer and add it to what has been seen previously.

One issue to look out for is encodings. Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

Mingun · 2023-02-11T15:33:21Z

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

I do not think that this will be a problem because although we work with bytes, we ensure that our boundaries are correct (=not within a character). I think, that users of spans in any case will expect that this numbers are indexes in the original byte stream.

Themayu · 2023-02-11T16:42:05Z

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

To clarify, does this mean that my work in #552 is unnecessary? Or would you like me to continue with it?

Mingun · 2023-02-11T17:58:13Z

Yes, as I can see, the parts that you've done in #552 I already implemented and I have feelings that they are worked right (but I didn't create tests for that yet, that's the one reason why I didn't make a PR).

I think, it would be better to use my approach, namely that parts:

span is used in PartialEq implementation but ignored in tests where it is not important
Option not needed, span 0..0 is used instead (this should decrease size of span field, and because we can expect that there would be many spans it is better to use as little space as we can)

You are free to finish my work if you like (you can rebase your branch over mine and update your PR if you wish). I'm not ready to finish it myself for now.

I do not insist on using feature-flag, I introduced it because spans can occupy some space and if you does not use them it is just waste of memory, but maybe the savings are not so great.

So the things that's should be implemented:

spans in attributes
spans in errors
tests
benchmark it to see if we need to hide it under feature-flag

Themayu · 2023-02-11T19:21:29Z

(you can rebase your branch over mine and update your PR if you wish).

How would I go about rebasing a branch on my fork over a branch on a different fork? Sorry, I'm not entirely well-versed in git usage; I've sort of been learning it as I go along.

span is used in PartialEq implementation but ignored in tests where it is not important

The primary reason for ignoring span in PartialEq was so that the addition of spans to the crate wouldn't break previously-working comparisons between Bytes* instances that came from different locations, such as comparing one from a reader with one constructed via new(), in transitive dependencies (if used in the dependent crate / other transitive dependencies) or even the dependent crate (if used in transitive dependencies). That it also made the tests work was a happy side-effect, but not the main intention.

Mingun · 2023-02-11T19:42:50Z

If you on Windows, then you're probably using TortoiseGit, then just use RMB-click on my commit in commit log and select Rebase "parser_position_tracking" onto this...(G). The other GUI clients should have the similar ways of doing that. If your prefer command-line, then I think you already known how to find required information.

wouldn't break previously-working comparisons between Bytes* instances that came from different locations,

I cannot imagine a reason why someone would be need to compare two events to compare their content or at all. I think we should not worry about that. PartialEq implemented mostly for our own purposes for tests. Of course, if you known the reason...

Mingun added enhancement help wanted labels May 21, 2022

phillord mentioned this issue Jul 8, 2022

Errors should contain buffer positions #411

Closed

Mingun linked a pull request Feb 11, 2023 that will close this issue

Parser position tracking #552

Draft

22 tasks

Mingun mentioned this issue Jul 25, 2023

Error Location similar to Serde JSON or TOML Spanned? #625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A way to get the row/column of a Reader #109

A way to get the row/column of a Reader #109

jonas-schievink commented Jan 23, 2018

jonas-schievink commented Jan 24, 2018

tafia commented Jan 30, 2018

tafia commented Mar 1, 2018

jonas-schievink commented Mar 1, 2018

DanielKeep commented Mar 16, 2018

tafia commented Mar 16, 2018

DanielKeep commented Mar 16, 2018

tafia commented Mar 16, 2018

tafia commented Mar 22, 2018 •

edited

Loading

bbigras commented Oct 4, 2018

Themayu commented Feb 11, 2023 •

edited

Loading

dralley commented Feb 11, 2023 •

edited

Loading

Mingun commented Feb 11, 2023 •

edited

Loading

Themayu commented Feb 11, 2023 •

edited

Loading

Mingun commented Feb 11, 2023

Themayu commented Feb 11, 2023 •

edited

Loading

Mingun commented Feb 11, 2023

A way to get the row/column of a Reader #109

A way to get the row/column of a Reader #109

Comments

jonas-schievink commented Jan 23, 2018

jonas-schievink commented Jan 24, 2018

tafia commented Jan 30, 2018

tafia commented Mar 1, 2018

jonas-schievink commented Mar 1, 2018

DanielKeep commented Mar 16, 2018

tafia commented Mar 16, 2018

DanielKeep commented Mar 16, 2018

tafia commented Mar 16, 2018

tafia commented Mar 22, 2018 • edited Loading

bbigras commented Oct 4, 2018

Themayu commented Feb 11, 2023 • edited Loading

dralley commented Feb 11, 2023 • edited Loading

Mingun commented Feb 11, 2023 • edited Loading

Themayu commented Feb 11, 2023 • edited Loading

Mingun commented Feb 11, 2023

Themayu commented Feb 11, 2023 • edited Loading

Mingun commented Feb 11, 2023

tafia commented Mar 22, 2018 •

edited

Loading

Themayu commented Feb 11, 2023 •

edited

Loading

dralley commented Feb 11, 2023 •

edited

Loading

Mingun commented Feb 11, 2023 •

edited

Loading

Themayu commented Feb 11, 2023 •

edited

Loading

Themayu commented Feb 11, 2023 •

edited

Loading