[mkgmap-dev] Increasing the performance of whole-planet splits.

Thu Jun 10 23:21:37 BST 2010

Hello Scott,

> Hello! I would like to submit the first of a set of performance
> optimizations to the
> splitter that improve the performance and reduce memory usage.

Fantastic!

> Perhaps the biggest change is that I've made the splitter accept my
> binary OSM format.

It'll be a few days before I have time to look at this in much detail and 
give you my thoughts but from what I can see it sounds promising. We actually 
use protobufs at my job for shunting around vast amounts of data so I'm pretty 
familiar with the pros and cons of them. One concern is that even if your 
binary format becomes reasonably standardised in the OSM world we'll still 
need to maintain .osm support for the forseeable future. Perhaps that can 
be tackled by swapping out the existing cache format with your protobuf format. 
Even that I'm not sure about in the longer term; to tackling some of the 
harder problems with the splitter will require more than just a linear protobuf 
or cache format. Data structures such as Btree indexes and so on will likely 
be required, though I'm sure we can cross that bridge when we come to it.

> some other optimizations and patches. The performance numbers I quote
> assume that I was using the binary format; I am unsure what benefit
> they have on trunk.

>From what I've seen so far of your patches, most of them look beneficial 
to the trunk too :)

> In IntIntMap, the current offset value for rehashing is way to low,

This is a very good catch, I'll get this checked in. Your 'node too many' 
fix sounds like a good one too, as do the 'inifinite recursion' fixes. I'll 
get these checked in as soon as I've had a chance to test them.

I'm not so sure about your lazy HashMap initialisation in Element.java, though 
it does highlight a bug that was introduced by the recent r109 multithreaded 
patch! Prior to patch r109, Element objects were never recreated, rather 
.reset() was called to just reset their state for each new node/way/rel. 
r109 changed that to create new instances each time, hence some of the object 
churn you were probably seeing that lead you to your patch. I'll have a think 
about to do here and probably run a few benchmarks to see what works best 
(eg keeping one current Node/Way/Rel per thread might be best).

[As an aside, note that in your lazy HashMap patch Collections.EMPTY_SET 
results in an unchecked assignment warning. That of course is no big deal, 
easily fixed with Collections.<String, String>emptyMap().entrySet(). The 
sting though is that the call to Collections.EMPTY_SET.iterator() actually 
creates a new iterator object each time(!). Better would be to create a single 
static immutable iterator and return that]

> My multithread changes reduced the real and user CPU time by 30-50% on
> a uk extract. Other tests, e.g., a whole planet, show almost no
> effect. I think this is because a lot of my changes increase the
> efficiency of code that is parallelized, which paradoxically ends up
> reducing the relative benefits of parellelization.

This sounds promising, I'll look at what you've done further in the next 
couple of days and hopefully get it committed this weekend.

> The single biggest remaining slowdown is that for each point, the
> splitter sends each Node to each area (represented by an OSMWriter),
> which writes it in the output file if it is in the bounds of the
> region (with overlap), or does nothing otherwise. On a whole planet,
> with 840 areas and 550M nodes, that is over 5 trillion bounds-checks
> that are not run in parallel. An index that can can map a node to the
> regions that may contain it could avoid 99% of these checks.

This is something I've been aware of for a long time but haven't found the 
time to tackle yet. One idea I have is to generate an r-tree in memory and 
run the node comparisons against that to find out what area(s) each node 
falls into. I'd estimate it would reduce the number of comparisons required 
from 840/node to less than 10. I suspect that would be a win over generating 
a node to area index which would be expensive to create intially, plus expensive 
in terms of disk access (or alternatively, require lots of RAM). Also, with 
a bit of effort these r-tree comparisons could be parallelised too.

I'll follow up in a while once I've had a bit more time to investigate and 
respond. Many many thanks for your work on all this, very much appreciated!

Chris