[mkgmap-dev] Copyright & License file reader improvements

Wed Dec 28 13:10:00 GMT 2016

Hi Marko (and anyone else who is interested), I'm happy to take note of your comments on ANSI and UTF-16, however, I disagree with your assessment of the code being good at present. The OSM data files may well be in utf-8 format, but are not (usually) files generated by the user and they have an XML header indicating the encoding. The copyright and license files are text files provided as inputs by the user and hence have no  encoding information provided. I can see no reason to consider that the user generated text files should be expected to be utf-8 encoded just because the OSM data files are encoded that way. There is no mention anywhere of utf-8 in the documentation, and when a copyright or license file is not utf-8 encoded then we get an error message that doesn't explain why the file failed to load. This is a recipe for frustrated users and questions being asked about why files won't load. It seems to me that if only a single file format is to be loaded then mkgmap should expect it to be using the default code page, which is how a text editor will normally save the file, unless specifically asked to do otherwise. Of course some systems may well be set up with utf-8 as the default. Up until very recently, the copyright file was not expected to be in utf-8 format. I suggest that perhaps one of the following options might be the way to go:

1:
Load the two files using the default code page.
If there is a failure, include the reason for the failure in the exit exception message.

2:
Update the documentation to  indicate that utf-8 must be used for license and copyright file
If there is a failure, include the reason for the failure in the exit exception message.

3:
Use the existing --code-page option to also determine the code page for the copyright and license files. If not specified, use the default code page.
If there is a failure, include the reason for the failure in the exit exception message.

I am happy to rework the patch for any of the above, but will wait for further comments/feedback before proceeding.

Regards,
Mike

-----Original Message-----
From: Marko Mäkelä [mailto:marko.makela at iki.fi] 
Sent: 27 December 2016 20:30
To: Development list for mkgmap <mkgmap-dev at lists.mkgmap.org.uk>
Subject: Re: [mkgmap-dev] Copyright & License file reader improvements

On Tue, Dec 27, 2016 at 06:07:11PM -0000, Mike Baggaley wrote:
>Hi Gerd, please find attached a small patch that improves the loading of
>copyright and license data when the --copyright-file and --license-file
>options are used. It will attempt to load the data using ANSI, UTF-8, UTF-16
>and the default code page. If it fails, more information is provided as to
>the reason why.

I am not Gerd, and I am not that active with mkgmap any more, but I have 
some interest in character encodings.

I had a quick look at the patch. It first tries ASCII (which is a proper 
subset of UTF-8), then UTF-8, UTF-16 and the default code page.

I do not think that there is any need to try ASCII separately. Any valid 
ASCII input is also valid UTF-8.

If the input is not valid UTF-8, things get tricky. I am not sure if 
UTF-16 is a good thing to try. Here is an example where 6 ASCII 
characters (which could be part of a non-ASCII, non-UTF-8 input) get 
misinterpreted as 3 Chinese glyphs in UTF-16:

$ echo -n foobar|recode utf16..utf8;echo
景潢慲

Because of this, I would omit the UTF-16 pass altogether. If UTF-16 
input is truly needed, the default code page could be set to it.

Also, some non-UTF-8 superset of ASCII could accidentally look like 
valid UTF-8. For example, the bytes 0xc2 0xa0 could represent the 
two-character string U+00C2 U+00A0 in ISO 8859-1. But the same bytes 
could also be interpreted as the single UTF-8 encoded character U+00A0.

I think that if multiple input formats are supported (which would be 
against the Unix philosophy of keeping programs simple), the selection 
must be explicit, by some command line switch that chooses to use the 
default code page instead of UTF-8.

In my opinion, the current code is good as it is. Because mkgmap already 
deals with mostly UTF-8 input (the OSM data), I think it is consistent 
to assume that all text files are encoded in UTF-8.

Best regards,

	Marko