logo separator

[mkgmap-dev] java.lang.AssertionError while building index from unicode tiles

From Gerd Petermann gpetermann_muenchen at hotmail.com on Mon Oct 18 09:12:55 BST 2021

Hi Ticker,

thanks for looking into this. I have no clue how to test if the index really works with those characters as I don't know how to type them.  If I got you right mkgmap isn't able to sort the city names so I wonder how the index can be of any use? I assume we have the same problem for other names like those for highways, POI etc?

Gerd

________________________________________
Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag von Ticker Berkin <rwb-mkgmap at jagit.co.uk>
Gesendet: Montag, 18. Oktober 2021 09:58
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] java.lang.AssertionError while building index from unicode tiles

Hi

Although 2 16-bit items (surrogate pairs in UTF-16 speak) are required
to represent many Chinese characters, this isn't the significant
problem in this case.

Problem is that resources/sort/cp65001.txt doesn't give ordering to
lots of characters; it looks like it covers only about 10,500 of the
1,112,064 possible code-points. Many of these non-ordered characters
are being used by the names in the tile in question.

The basic handling for other codings (eg cp125*) uses a missing sort as
the basis for ignoring the character; it won't be represented in the
output so no point in considering it in the sorting.

This isn't the case with Unicode as all characters should show, but,
more importantly relating to this crash, stable sorting is required for
de-duplication of some of the index structures this isn't happening
because of characters being ignored.

Assuming the actual ordering of unspecified code-points doesn't really
matter, I propose to change the logic slightly so undefined Unicode is
sorted on its 16-bit value after the range of known sorts.

I also need to make SortKey generation consistent in a similar way, fix
some of uniqueness tests to be consistent with the sort and verify that
the size of mdr5 is >= mdr25 so this type problem is detected before it
is exposed when mdr25 indexes can't be represented in the same number
of bytes as mdr5 indexes.

Ticker


On Sun, 2021-10-17 at 11:16 +0100, Ticker Berkin wrote:
> Hi
>
> It is most likely that this problem is because Chinese requires 2
> UTF16 chars to encode many of its characters - see
>
> https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
>
> I think it is only  --index processing where this is a problem
> mkgmap.
>
> I'll investigate  more
>
> Ticker
>
>
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev at lists.mkgmap.org.uk
> https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev


_______________________________________________
mkgmap-dev mailing list
mkgmap-dev at lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev


More information about the mkgmap-dev mailing list