[mkgmap-dev] New branch for default typ file

Wed Dec 18 11:06:19 GMT 2019

Hi Randolph

This topic should probably become a new thread.

You shouldn't confuse the encoding of the java source text (rules
determined by the java language) with how a java program reads a text
file into its internal character format (however the programmers want
to do it, but the java library supplies converters for almost all
character sets/encodings).

I agree that the text file input processing of mkgmap should allow for
a BOM in all cases and use it to determine the correct unicode input
decoding. There are various possible input files with a mix of
character set/encoding determination and BOM acceptance.

A quick look for the the various txt inputs, I find:

style components: In the default style, all are pure 7-bit ascii.  
except inc/address which contains some UTF-8 encoded characters.

road-name-config: this is read as UTF-8.

TYP: This checks for a UFT-8 BOM as the first character on a line, and,
if not there, looks for a line starting with 'CodePage=' and uses what
follows, with cp65001 taken to mean UTF-8. It has some logic to default
to cp1252 and some other convolutions.
There are many incorrect assumptions in this handling, the main one
being that CodePage is there to determine the output charset, which can
be determined from the main mkgmap map options anyway.

-c options.cfg: I haven't studied the logic for this, but it probably
uses the character set/encoding determined by Java from the
environment; on unix maybe $LANG with typical value "en_GB.UTF-8"

command line parameters: ditto

copyright/licence-file: not looked

delete-tags-file: not looked

other files: ?

Most of these areas could benefit from a unified way of determining the
input character set and encoding, but we need to beware of backward
compatibility, where users have their own components in a code-page
relevant to their area.

I suggest something like the following, in order:

1/ Look for a BOM for any of the unicode encodings near the start of
the file; not necessarily the first character, because, without
changing the next level of the file parser, it might need to be in a
comment.

2/ Look for the 1st or 2nd line of the format:
{comment-indicator} -*- coding: {charset} -*-
where {comment-indicator} is typically a '#'. and {charset}, for
unicode, represents the encoding as well. This method is used by Python
and was common on unix systems and recognised by many text editors
before UTF-8 became ubiquitous.

3/ Default to UTF-8 or the environmental default depending on context,
to be compatible with current handling.

Ticker

On Tue, 2019-12-17 at 15:20 -0600, Randolph J. Herber wrote:
> Dear Sirs:
> There has been a thread of discussion of whether there should be a
> Beginning Of Message (BOM) at the beginning of a UTF-8 file.
> This discussion is complicated by the fact that some of the
> developers work on Unix, Linux, BSD, iOS, Solaris and Windows. These
> operating systems have UTF-8 handling libraries written at different
> times and to different Unicode standards. Originally the Unicode
> standard said that UTF-8 should not have a BOM character at the
> beginning of a file. Later Unicode changed the standard to a BOM is
> permissible, not required and not recommended. Microsoft added a BOM
> to the beginning of UTF-8 files before doing so was permissible to
> ease the problem of recognizing a UTF-8 file. This broke the other
> operating systems' handling of UTF-8. Microsoft petitioned for the
> permissibility of a BOM to avoid changing their file handling.
> At this time, I believe at all programs should use Unicode and not
> Microsoft code pages. I have had problems with Microsoft code pages
> since MSDOS days.
> Splitter and mkgmap are written in Java. Java still follows the
> original Unicode standard of no BOM at the beginning of a UTF-8 text
> file. This is a "not to fixed" situation per the Java language
> developers. This situation results in problems with Java,
> particularly in a Microsoft Windows environment,
> The code fragments below provide Java solutions to writing a BOM at
> the beginning of a UTF-8 text files so that Microsoft native text
> editors can handle them and, on reading a text file, provides       a
> automatic way of ignoring an optional BOM by checking for the BOM
> after file opening.
> A test for execution in a Windows environment is provided below if
> one decides to add a BOM only on Microsoft Windows.
> I have not downloaded the splitter and mkgmap sources and searched
> for the appropriate places in their sources to apply the changes. I
> feel the main splitter and mkgmap developers are placed better to
> make these changes. This is the reason that I did not provide patches
> to the sources.
> Randolph J. Herber.