Speed up OSM processing some more#7583
Conversation
# Conflicts: # application/src/main/java/org/opentripplanner/osm/model/OsmEntity.java
|
@VillePihlava @abyrd Could I interest you in reviewing this one as well? |
|
I can take a look |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev-2.x #7583 +/- ##
==========================================
Coverage 72.54% 72.54%
- Complexity 21266 21271 +5
==========================================
Files 2380 2380
Lines 86400 86427 +27
Branches 8486 8489 +3
==========================================
+ Hits 62675 62701 +26
+ Misses 20728 20727 -1
- Partials 2997 2999 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
VillePihlava
left a comment
There was a problem hiding this comment.
Nice improvements. Should private final Map<String, String> tags; in OsmEntity be defined as non-nullable? This way the use of isTagless would always indicate a performance concern.
| @Nullable | ||
| private final Map<String, String> tags; |
There was a problem hiding this comment.
Would it make sense to make this non-nullable? Constantly calling the isTagless function feels redundant. Something like a static default empty map that this gets initialized to when no tags are present. Then the isTagless function can be used in places where it is necessary for performance reasons, not for null checking.
There was a problem hiding this comment.
This is a micro(?)-optimization. If you use a Map you still have to call toLowerCase and compute the hash of the string.
There was a problem hiding this comment.
I know that using null is not super-idiomatic in our current code base (I'm its biggest hater) but in this case it's worth it for performance.
There was a problem hiding this comment.
If it makes sense for performance reasons, I think its fine. Maybe a comment should be added every time it is used as a performance optimization?
There was a problem hiding this comment.
Another way would be to create a custom map implementation that null checks behind the scenes. Do whatever you think makes sense
| @Nullable | ||
| private final Map<String, String> tags; |
There was a problem hiding this comment.
If it makes sense for performance reasons, I think its fine. Maybe a comment should be added every time it is used as a performance optimization?
| @Nullable | ||
| private final Map<String, String> tags; |
There was a problem hiding this comment.
Another way would be to create a custom map implementation that null checks behind the scenes. Do whatever you think makes sense
VillePihlava
left a comment
There was a problem hiding this comment.
LGTM! Lets consider at some point whether the tag.toLowerCase() calls are really necessary.
| geometry = GeometryUtils.getGeometryFactory().createLineString( | ||
| segmentCoordinates.toArray(new Coordinate[0]) | ||
| ); | ||
| geometry = GeometryUtils.makeLineString(segmentCoordinates.toArray()); |
There was a problem hiding this comment.
If only Java had compound value types... any day now.
| /** | ||
| * Does the entity contain any tags at all? | ||
| */ | ||
| final boolean isTagless() { |
There was a problem hiding this comment.
My understanding from other review comments is that the nullable tags field and this function exist largely to avoid String.toLowercase calls. As far as I know, OSM tags are case-sensitive and the vast majority of tag keys are lower case. Any non-lower-case version of a typically lower-case key would be considered a new nonstandard key and show up as such in the tag stats. The case insensitivity in OTP appears to be an adaptation to poor data quality in the early era of OTP, and may date all the way back to #500. I suspect OSM data quality has increased significantly in the intervening 15 years.
Would it be possible to performance-test a version using a canonical empty map instead of null and eliminating all toLowerCase calls and all null checks? I do wonder if it would be just as fast and significantly simpler.
There was a problem hiding this comment.
Okay, this appears to bother both you and @VillePihlava quite a bit, so let me give a long form explanation.
For well over a decade null was used in the OSM code to mean "no tags". I started using the empty map in my previous OSM PR #7537 but later noticed that there was a performance problem with it. For this reason, I re-established the previous state of using null. I understand that it's not super-idiomatic to how we use collections nowadays. I don't have exact numbers but I could see a small speed up when switching back to null.
Besides, it's a well-encapsulated variable and the null never escapes this class.
Your point about lowercasing is correct, but when I tried switching it off I noticed that a test failed so I never pursued that avenue any further. There are a handful of tags we use like ref:IFOPT (https://www.openstreetmap.org/way/27558650) that are frequently written in the incorrect case and the lowercasing is still needed somewhere.
It could probably be removed in the hottest inner loop, but it requires some risk management and careful investigation.
In any case, that would be a functional change that I would like to keep separate from a performance optimization that aims to keep everything as it was before.
Given all of the above, I politely request that you let me keep the method/null value as-is.
If you feel that it ought to be done, I can do that in a follow-up.
There was a problem hiding this comment.
There is one more aspect to isTagless: it is used in some complex methods that check a variety of tags to short-circuit the computation. This is a significant speed-up.
There was a problem hiding this comment.
No, use of null references doesn't particularly bother me. My review was a comment, not a request for changes to your PR. I had the impression complexity was being introduced just to continue supporting case-insensitive behavior in a place that is by specification supposed to be case-sensitive, and if the case-insensitive behavior was unnecessary maybe the problem could be eliminated. Since you've already considered this and decided case-insensitive behavior still needed, no change is necessary.
Summary
Just as in my previous PR, I am rewriting some hot code paths in the OSM processing code.
The optimizations boil down to:
Streams with some manual processing, thereby lowering allocationsOsmNode, where the vast majority doesn't have any tags at all.CoordinateAll of this yields about a 20-25% speed up.