way_nodes #18
Replies: 9 comments
-
In the new model the We'll still have to deal with the problem of having to create a "fake" history, so that people can access the OSM history data with the new model. I believe we did something like this when we removed segments. So we have to go through the data and pretend we had the new model all along and "invent" the version numbers as they would have been. This will be rather cumbersome but we only have to do this once. (We could also just ignore this and tell people they have to use the old software for the old data and start with version 1 again with the new data, but I expect some people will not be too happy about this.) |
Beta Was this translation helpful? Give feedback.
-
Yes, the old approach („we have removed past history data from the database“) won’t work today, rewriting the complete history seems unavoidable. But this has rather heavy side effects. I'm looking at this primarily from an Overpass API point of view, in particular the attic use case, which powers Achavi, but also OSMCha to display changesets. As it works now, you would start with a 2012 planet file and subsequently apply minutely diffs covering about 10 years of changes. This process typically takes a fair amount of time (think "weeks" rather than "days"). Using a full history planet instead unfortunately doesn't work as of today. For this use case to continue to work, we would have to recreate millions of minutely/hourly/daily diffs, pretending the new data model existed all the way back to 2012 and publish them on https://planet.openstreetmap.org/ - and then start fresh with a new database. By the way, we'd also have to recreate the changesets diff files, as the "number of changes" field per changeset would no longer be in sync due to the deleted nodes and synthesized way versions. |
Beta Was this translation helpful? Give feedback.
-
At least for the way node topic, we could probably provide something useful for consumers in an API 0.7 format, while the rest of the world is still on API 0.6. I don't think it's required for everyone to switch at the same time, and I think it's not feasible either.
This needs some extra disk space (fairly cheap compared to the overall effort). However, data consumers would benefit very early in the process already without a need for them to switch to something new again further down the road. They can also gradually switch to the new version and provide early feedback. If things go really wrong, stopping the process at step 4 would be still possible without causing major harm to the overall ecosystem. Also worth noting: we would never delete untagged nodes in api 0.6 and push that change to consumers (it’s way too much data!). Instead the new format doesn’t include untagged nodes right from the start. Consumers will need to start fresh with the new format. // cc: @gravitystorm |
Beta Was this translation helpful? Give feedback.
-
First of all, thank you for the observation. I remember that on SotM we came to the conclusion that forcibly counting up way versions if their nodes have new versions would be possible. This means that this would be acceptable computation effort on the API consistency check side. I take this as an assumption for the following. We could then first start to retrospectively compute the proper version numbers for the way under the new scheme and then continue to use them. I.e.
Incremental changes with reduced complexity have better chances to succeed than the big bang. If there is upfront communication that there will be a number of changes and the rough roadmap then we will for sure overwhelm less people. Version numbers are, opposed to time and trust, an unlimited resource. |
Beta Was this translation helpful? Give feedback.
-
@drolbr : I'm not exactly sure I'm following your reasoning. If we have to compute a new version numbering way by way anyway in Step 4 (which I assume includes all current and historic versions of all ways), what would be the added benefit of forcing this constraint on editing applications so early in the process, if we can do all of this by an automated process instead? I'm a bit worried that you're planning with 3 different API versions in step 4. I think I really need to start looking for another maintainer for CGImap ;) |
Beta Was this translation helpful? Give feedback.
-
Because we can. The comment is not a high-level implementation, but considerations how much we can disentangle things to avoid complexity. The dependency is that as long as we accept non-conforming changesets, we cannot have the final renumbering for the ways. I.e. as long as non-conforming changesets are allowed, we must ask the history data consumers to be prepared for renumbering over and over again, and keep some renumbering data logic within the database. With tons of edge cases, many of whom never would materialize unless in a moment when we cannot respond to them. Basically we have
The only two other approaches I see so far are Offline numbering with a forced downtime. As it looks for now, this simply has not enough support in the community because too few mappers make the connection between impaired history viewers and way version numbering. This may kill the mood to make changes to the API at all. Jumping way versions on changeset upload. Our API spec allows to do that, but I expected it to be more effort because it must be either fast enough to do it for every way in the changeset on the fly on the server side or have some bookkeeping and strategy if and when to recalculate a way's history. |
Beta Was this translation helpful? Give feedback.
-
What I can say is that in my proposal, my main focus is to provide a solution for consumers like rendering engines or routing engines, which don't really care about a way version number, but want a small planet file as early as possible. Also, it would give applications like Overpass API enough time to prepare a separate full planet database well ahead of the switch to the next API version. This can all happen without touching anything else, but needs a second copy of the current database tables, rewritten way versions, and a logic to keep both in sync. I think you can do that, e.g. in CGImap inside the same database transaction without facing an endless number of special cases. I think our approaches are fairly different and need much more discussion. We should take a closer look at our different stakeholders, what issues do they have, and most importantly, how do we address their issues. It's also important to have a good understanding of effort, impact, and timelines for each stakeholder, to make an informed decision about what works best for everyone. One point where I definitely agree with you, is that we want to do the area stuff later (I forgot to mention that earlier on). What I'd really like to see is more maintainers of different provider and consuming apps, and of course Rails port, taking part in the discussion. I'm not exactly clear what they think about those proposals, and unfortunately, none of the SotM discussions are documented anywhere. Needless to say that there's also no recording available, e.g. for the BoF session. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your comment. I suggest to do the following two next steps:
Now to the most likely incomplete lists of things to do and a rough estimate of effort: Catering for pure geodata consumers, i.e. producing a current planet and minute updates where old states don't matter. Version numbers and short lived data should not bother these consumers at all, and even if so, then they must be prepared to accept jumping version numbers. This is somewhat computational expensive to resolve the nodes to coordinates. But indeed no special cases are involved, and several implementations for that task exist. Enforcing way versions for node changes during upload. First of all, there are two implementations for the code - I do expect that the OWG wants to keep The Rails Port in addition in CGImap which does the real work. There might be a third ensemble in the form of the individual element uploads which has probably substantially different code from the bulk upload. One half is to check for all ways that are referred to in the bulk upload get extra versions of the nodes are touched. The other, more expensive half is to resolve the backlinks from nodes to ways and create new versions for them. There are a couple of possible solutions on the database level for this, although no straightforward preferred one. However, few to no special cases exist, only the usual suspects of deleted objects on either end, but the database scheme may have constraints that prevent these special cases from existing. Renumbering way versions. This is the corner case monster. The underlying reason for that is that the coordinates of a way simply are not a properly defined concept in the data model. There are different node version with the exact same timestamp. Shall they trigger multiple way versions? What if different nodes change at the same timestamps? Then, given that the timestamps are totally reliable and where quite off before CGImap, there are many way states that existed as a result of applying a minute diff, but does no longer exist even in a history according to the doctrine of timestamps because of out-of-order timestamps for the referred nodes. In the end, the way versions are something that are not computed by implied logic but rather set by a single source of truth, to overcome the disambiguities mentioned above. I do think this should be live for as little time as possible, and run once ahead of time to check whether the outcome has made artifacts from the historic artifacts or not, then re-run to produce the final way version numbers when we are asserted that we have fewer artifacts afterwards than before. If this is run live along uploads then there is a certain chance that bugs in this code will derail data users (editors and consumers) with producing unexpected and/or impossible states. |
Beta Was this translation helpful? Give feedback.
-
The reason I mentioned geodata consumers as one of the very first potential users is that they don't need to expose version numbers to end users (which would be confusing in case different tools report different version numbers for the same thing) . This doesn't mean that we shouldn't aim for a planet files, and minutely diffs, which are equally useful for other tools, such as history viewers. I don't really like the concept of jumping version numbers. Our optimistic locking depends on those version numbers, and it's very easy to break things.
Yes, our current approach is that we want maximum compatibility, and all new CGImap features need to be implemented on the Rails Port as well. CGImap is the acceleration layer, Rails port the reference implementation. That doesn't mean that both implementations are exactly the same (which they aren't).
Single element upload hasn't been implemented on CGImap so far. It could be easily added by wrapping a single element XML in a OsmChange message on the fly, and process that message by the diff upload. If we decide to have any 0.7 logic on CGImap only, that would be one of the very first topics that needs to be looked at. We might also decide that we want to discontinue single object endpoints altogether, as they're causing various issues in this scenario. In particular they could lead to an inflation of new way version numbers, if people update one node after the other and the API has no other chance than to create a new way version for each API request. It would be best, if people would upload as much data as possible at once, ideally, with as few uploads as possible. In general, Rails port reuses the single object processing for the diff upload, by iterating over all objects in the message. CGImap works in a different way and looks at a number of objects in OsmChange, and processes them all at once. That's the reason why objects in older (Rails generated) changesets have increasing timestamps (see https://www.openstreetmap.org/api/0.6/changeset/34535555/download as an example).
When deleting nodes, we're already checking today, if those nodes are still referenced by ways. So performance wise this shouldn't be an issue. Maybe we need an additional exclusive lock for those ways, in case we want to generate new way versions when nodes are being moved, and need a consistent view on the data.
That's a good question, and I'm not in the position to answer it right away. Today's most prominent creator of such nodes is StreetComplete, in the past it was frequently wheelmap_visitor. However, all those nodes typically have tags and tend to be POIs. Unless they were also part of a way, they would be out of scope for this exercise. I'm not sure if there are many examples of such nodes, where multiple node versions share the same timestamp and they would have an impact on way versions. Needs further investigation.
I mentioned the slowly moving timestamp example for one changeset upload earlier on. One thing we definitely don't want to do is to create separate way versions of each intermediate timestamp. Those timestamps are merely an artifact of a slow Rails implementation, whereas the mapper only wanted to move a given way by 100m. I think it would be important to distill the original intention out of the data, and create new way versions where it is reasonable. This will be very challenging, and I'm expecting several iterations until this is yielding the desired results. I’m seeing your point to make those changes a bit more explicit by forcing new way versions. still ways only refer to node ids without version and the timestamp doesn’t have to be the same for node and way changes. In the end, it’s still somewhat of a heuristic you need to apply here even with forced way versions. Besides, the API would already know which ways are impacted when moving nodes by looking up respective ways in the APIDB 0.6 database. Based on this information only, you could already create new ways in a separate APIDB 0.7 in this case. Let's consider the forced way versions for a moment, which you want to introduce when moving untagged nodes. I see two alternative options here:
Based on this assumption, the forced way version won't add any new information, which the API doesn't already have when analyzing the upload. The API can simply generate the forced way version on its own, without bothering the editing apps with it. I think we really need an in depth analysis of actual data to better assess how bad this issue really is, and which approaches actually work.
That's why I want to take editors completely out of the picture before the consumer side of the house has reasonably stabilized. There's simply too much risk if we touch both editors and consumers at the same time. Forcing way versions is an incompatible breaking API change, which we must avoid at this stage. If you start with the consumers, you can start small with a selected group of applications, which is willing to experiment and reload everything if things go wrong (yes, this will happen, no matter how much testing you do upfront). I doubt that it will be sufficient to tell consumers to "just" renumber their objects. In the end, they also need to delete lots of nodes. We would not send those out via minutely diffs, but simply omit them right from the beginning for the new format. My current assumption is that reloading is unavoidable in many cases. If clever consumers download two planets (0.6 and 0.7) and do a side-by-side comparison and update their data based on that, it would be their responsibility to make sure it works. Once the process has stabilized further, more consumers might want to take advantage of the new format. If consumers are happy, and editors have implemented the new API in the meantime, and have done lots of testing in a test environment, we also bring them to the new world. |
Beta Was this translation helpful? Give feedback.
-
I believe at one point the study proposes to apply this logic to the way_nodes table, which also includes all previous versions of a way.
As node versions may change independent of way versions, it's not clear to me, how the concept of generating new way versions should be applied in this case.
Renumbering past way versions, and introducing new (artificial) way versions, or discarding some of the node versions are the only options I can think of. However, both have rather hefty side effects, which are not discussed in much detail in the study.
Applying the concept on the current (v1 only?) + any future changes only might defeat the purpose of getting rid of billions of untagged nodes. Maybe in reality it would be good enough to leave non-current versions alone, and only focus on the current version?
Hence my question, what would be the suggested approach to deal with this way/node version mismatch?
By the way, Overpass API (almost) solves this issue by introducing timestamp based intermediate versions every time a node is moved and the way version stays the same. I'm sure this approach wouldn't work for the API.
Beta Was this translation helpful? Give feedback.
All reactions