[mkgmap-dev] splitter - performance and memory

Sun Jul 19 21:48:31 BST 2009

By all means go ahead. I think many people would welcome a Splitter that 
can handle the whole world. I think that Steve Ratcliffe also announced 
that he would be working on new version but he will no doubt respond.

If your expectation of less then 50% memory usage is correct then you 
should be able to handle the world on your 6GB machine as North and 
South America (~2/3 of the data) can be split using 8GB ram with the 
current splitter.

Chris Miller wrote:
> First off, let me say a quick thanks to everyone involved in this 
> project. I've only just discovered it recently but wish I had found it 
> much earlier, it really is incredible to see what has been done so far.
> 
> I've downloaded the complete planet-090715.osm.bz2 file, and have been 
> looking at splitting it. I read the description and limitations of the 
> splitter.jar tool but decided to give it a go anyway since I have a 64 
> bit OS with 6GB ram. Unfortunately it still failed with a -Xmx5200m 
> heap. I have a 16GB machine at work I could try it on that might work, 
> however instead I decided to take a look at the source code to see if 
> there's any possibility of reducing the memory requirements.
> 
> I've only spent a short time looking at the code, but as far as I can 
> tell the whole first step (computing the areas.list file) is using far 
> more memory than it actually needs. The SplitIntMap (which is what takes 
> up all the memory) isn't required here for two reasons. One is that the 
> code never retrieves entries via .get(), rather it only uses an iterator 
> so a list/array would suffice. Second, the node IDs aren't used in this 
> stage so we don't even need to parse them let alone hold on to them. 
> Assuming we replace the SplitIntMap with a wrapper around an int[] (or 
> multiple int[] to mitigate the double-memory-on-copy problem), we'd be 
> looking at memory savings of >50%.
> 
> Does that make sense or have I missed something? If it sounds sensible 
> I'd be happy to have a go at implementing it. Also, given the nature of 
> the algorithm it wouldn't be too hard on performance if the lat+long 
> values were written out to disk rather than held in memory which would 
> mean splitting the whole dataset would be feasible even on a 32bit machine.
> 
> I haven't yet looked at possibilities for tuning the second step but I 
> assume that some sort of map/lookup is still required here. I figure 
> there's a few options - perform multiple passes processing a subset of 
> the splits at a time (limited by the total number of nodes we can hold 
> in memory), optimise the existing data structures further, page some out 
> to disk, etc.
> 
> I was also thinking a little about performance. Given the enormous size 
> of the full .osm file, I'd suggest a move away from SAX over to a pull 
> parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). 
> It's even faster than SAX and uses very little memory. In my job we use 
> it to parse many GB of XML daily with very good results. Another idea is 
> to parallelise the code by running parts of the split on different 
> threads to take advantage of multi-core CPUs. Possibly the best gain 
> here would be when writing the files since gzip compression is fairly 
> CPU intensive.
> 
> What do people think? I'm happy to work on the above though I must 
> confess up front that my spare time is quite limited so please don't 
> expect too much too soon!
> 
> Chris
> 
> 
> 
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev at lists.mkgmap.org.uk
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev