[mkgmap-dev] Increasing the performance of whole-planet splits.

Sat Jun 12 02:52:59 BST 2010

On Fri, Jun 11, 2010 at 6:52 PM, Chris Miller <chris_overseas at hotmail.com>wrote:

> Hi Scott,
>

> This probably explains why
> no one from the mkgmap/splitter community noticed or commented on your
> original
> post on the osm-dev list - if they're anything like me they don't pay too
> much attention to what's going on there so I guess it got missed).
>
>
It was ready to get feedback, but not yet ready IMHO to be used.

As far as the splitter is concerned, I'd be happy to use it internally
> instead
> of the current cache format given the speed boost it appears to offer. I've
> always considered the cache to be a short lived/temporary thing anyway,
> purely
> as an aid to speeding up the split, so anything that helps that is a win.
>

> Using a binary format between the splitter and mkgmap would speed things
> up immensely too, in fact we've discussed it a bit in the past on the
> mailing
> list. Before we jump in too deep however, can you comment on how well your
> format has been accepted/intergrated into Osmosis or any other osm
> projects,
> and how stable the file format now? (ie, has it been battle tested? ;)

No use that I know of. I got busy after my initial emails, so I haven't had
time to clean it up and package it as a true osmosis plugin. After I do
that, then the chance that it may get more use increase, especially if other
tools are engineered to use it. I've been using it for my own use for the
last couple of months.

> If
> it looks like becoming a standard/supported format in the osm community
> then
> supporting it as both input and output formats in the splitter is a bit of
> a no-brainer, likewise as an input format for mkgmap (Steve and co, correct
> me if I'm wrong!). I admit I'm a bit wary of the complexity of the format
> and the potential maintenance effort required, especially if the format
> isn't
> used elsewhere in the osm community other than splitter and mkgmap.
>
>

> I've had a look at this patch - I'd be interested to hear your thinking
> behind
> the changes you made (because I'm too lazy to try and get my head around
> it by studying the code!). Are you sure it's behaving as you intended?

Yes.

> The
> --max-threads parameter is being ignored, instead you're creating as many
> threads as there are areas in a pass (up to 255). This seems to result in
> fairly unpredictable bursts in activity, sometimes quite agressive, but
> generally
> most of the threads are sitting idle.

Yes, to some extent the activity is variable. However it seems that the
current design has the worker threads effectively busywaiting, trying each
of the queues to see if there is data in it to process?

> In my benchmarks I'm seeing your patch
> run a bit quicker and with higher CPU utilisation than the r109 code. It's
> about 20% faster for a UK extract, 9% faster for the planet (on a core i7).
> Note that WanMil's original patch uses (cores - 1) for maxthreads.
>
> I suspect the reason the planet doesn't do so well as a smaller area is
> because
> the planet has more areas in the split and so takes longer to fill up a
> given
> 'bundle' to trigger some processing.

I suspect it is from a different cause. A planet split has proportionally
more serial code than a UK extract. On a planet, summing over each pass,
each node is sent to all 800 areas to see if it is to be included. On UK,
each node is only sent to the 20 or so areas. Now that reading is so fast, I
strongly suspect the system is bottlenecked in this serial code.

This reduces the amount of CPU used
> on average, since we'll get more idle periods (when many areas are still
> filling their buffers), but the maximum possible work being done when
> buffers
> fill up is still throttled by the number of cores.
>
>
I don't believe so. On a core i7, the system is rarely using more than 2
cores, which indicates that parallel code cannot be a bottleneck. In the
steady state, a bundle can be processed in a few ms each (my core2 duo can
do 500k nodes/sec), the aggregate unprocessed work of all incomplete bundles
is a second or two of CPU time. This work is not avoided, only delayed. In
the steady-state, this is lost in the noise. Reducing the bundle size will
only change the delay before the work is done, at a cost of increasing the
number of memory barriers in the concurrent queue code.

I know people leave the splitter
> running as a background task on their desktop PC so if we're periodically
> bursting 10+ threads at full CPU I can imagine they might get a bit annoyed
> at times. Thoughts?
>
>
>From so many cores being idle, I don't see how we could be getting bursts of
10+ threads unless the serial code gets sped up. This could be a risk if the
density-map idea goes in and massively reduces the serial work. If this
worries you, maybe hold off on this part of the patch until after the
densitymap is in, or put it in and be prepared to revert later?

A couple of ideas I have here... perhaps reducing the bundle size will help
> keep the workers fed (especially when there is a larger number of areas),
> and also using a threadpool of max-threads to do the work on, rather than
> one thread per area, would help prevent CPU starvation for other
> applications,
> without adversely effecting performance?

If CPU starvation is a problem, can't the splitter be set to run with a
lower priority?

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.mkgmap.org.uk/pipermail/mkgmap-dev/attachments/20100611/a734f0de/attachment.html