logo separator

[mkgmap-dev] Copyright & License file reader improvements

From Marko Mäkelä marko.makela at iki.fi on Tue Dec 27 20:29:48 GMT 2016

On Tue, Dec 27, 2016 at 06:07:11PM -0000, Mike Baggaley wrote:
>Hi Gerd, please find attached a small patch that improves the loading of
>copyright and license data when the --copyright-file and --license-file
>options are used. It will attempt to load the data using ANSI, UTF-8, UTF-16
>and the default code page. If it fails, more information is provided as to
>the reason why.

I am not Gerd, and I am not that active with mkgmap any more, but I have 
some interest in character encodings.

I had a quick look at the patch. It first tries ASCII (which is a proper 
subset of UTF-8), then UTF-8, UTF-16 and the default code page.

I do not think that there is any need to try ASCII separately. Any valid 
ASCII input is also valid UTF-8.

If the input is not valid UTF-8, things get tricky. I am not sure if 
UTF-16 is a good thing to try. Here is an example where 6 ASCII 
characters (which could be part of a non-ASCII, non-UTF-8 input) get 
misinterpreted as 3 Chinese glyphs in UTF-16:

$ echo -n foobar|recode utf16..utf8;echo
景潢慲

Because of this, I would omit the UTF-16 pass altogether. If UTF-16 
input is truly needed, the default code page could be set to it.

Also, some non-UTF-8 superset of ASCII could accidentally look like 
valid UTF-8. For example, the bytes 0xc2 0xa0 could represent the 
two-character string U+00C2 U+00A0 in ISO 8859-1. But the same bytes 
could also be interpreted as the single UTF-8 encoded character U+00A0.

I think that if multiple input formats are supported (which would be 
against the Unix philosophy of keeping programs simple), the selection 
must be explicit, by some command line switch that chooses to use the 
default code page instead of UTF-8.

In my opinion, the current code is good as it is. Because mkgmap already 
deals with mostly UTF-8 input (the OSM data), I think it is consistent 
to assume that all text files are encoded in UTF-8.

Best regards,

	Marko


More information about the mkgmap-dev mailing list