logo separator

[mkgmap-dev] TYP files and character encoding

From Ticker Berkin rwb-mkgmap at jagit.co.uk on Tue Jan 14 09:43:10 GMT 2020

Hi Gerd

Here is updated patch that closes the file, although I find many files
in mkgmap that don't have explicit close(), but I presume .finalize()
will close them eventually.

I'll do another patch for other text file handling, using
StandardCharset where possible and fixing TokenScanner message for bad
characters if not utf-8 and, if reasonable, allowing a BOM even if the
file is opened as utf-8 anyway.

Ticker

On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> Hi Ticker,
> 
> thanks for the patch.
> 
> Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> closed. Is that intended?
> 
> I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> sources. I think it would be good to use StandardCharsets.UTF_8 where
> possible
> and unify the rest.
> 
> Gerd
> 
> ________________________________________
> Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im Auftrag
> von Ticker Berkin <rwb-mkgmap at jagit.co.uk>
> Gesendet: Montag, 13. Januar 2020 11:34
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
> 
> Hi Gerd
> 
> I've updated this patch with changes to TypCompiler CharsetProbe:
> 
> 1/ looks for unicode BOM in various encodings near start of file.
> 2/ looks for line containing "-*- coding: charset -*-" near start of
> the file.
> 3/ retains the check for "CodePage=" coding for compatibility.
> 4/ in the absence of the above, sets the reading charset to utf-8 if
> the file is valid utf-8, otherwise to Cp1252.
> 5/ fixes the bad character message from the scanner to say what the
> charset really is rather than saying "uft-8" regardless.
> 6/ removes the logic to that checks if String... lines, read in the
> charset it is currently trying, can be encoded in the presumed output
> CodePage.
> 
> The final result of this patch should be that:
> 
> a/ No existing usage is broken
> b/ 2 methods to indicate the charset/encoding of the file that are
> commonly used by text editors can be used and are taken notice of.
> Previously, just the UTF-8 BOM was detected.
> c/ Typ files can, and should from now on, be written in utf-8
> d/ labels for languages not supported in the --code-page of the
> output
> img just generate a warning in mkgmap.log.x
> 
> Ticker
> 
> 
> On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > Hi Gerd
> > 
> > Attached is a patch that:
> > 
> > Doesn't use the 'CodePage=' command in the typ-file to determine
> > output
> > character encoding of the typ-file, rather it uses the main map
> > encoding from the --code-page argument.
> > 
> > log.warn's any typ labels that can't be encoded in the --code-page,
> > rather than just giving up with message like:
> > > TYP file cannot be written in code page 1252
> > 
> > The message:
> > > WARNING: SortCode in TYP txt file different from command line
> > > setting
> > that was written direct to system.out is changed to a log.warn and
> > it
> > shouldn't happen anyway now
> > 
> > For the moment, the 'CodePage=' command in the typ-file is, under
> > some
> > circumstances, used to determine the encoding of the typ-file
> > itself
> > and I've left this alone for compatibility with existing useage.
> > Sometime in January I'll provide a better method for this
> > 
> > Ticker
> > 
> > 
> > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > > 
> > > I think it is best to continue with the ideas for typ-files that:
> > > 
> > > 1/ they can be in any character set and we just need a better way
> > > of
> > > working out the correct one - see my posting earlier today.
> > > 
> > > 2/ it can include as many languages as anyone can be bothered to
> > > add,
> > > and so has to be an a character set that allows the languages to
> > > be
> > > added, implying unicode for a common one (more particulary, UTF
> > > -8)
> > > 
> > > 3/ the codepage= statement should be redundant and ignored for
> > > controlling the output character set, which should be taken from
> > > the
> > > map, but its use for determining the input coding might need to
> > > be
> > > kept
> > > for a while for compatability.
> > > 
> > > 4/ the messages my hack generates should be turned into 1 warning
> > > or
> > > information message per language or maybe suppressed altogether.
> > > If
> > > someone is generating a map with a character set that doesn't
> > > support
> > > a
> > > particular language, they really won't care that that data for
> > > other
> > > languages that have an incompatible representation with their
> > > language
> > > won't be there.
> > > 
> > > Ticker
> > > 
> > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > > 
> > > > I think I understand now why we didn't have a default typ file
> > > > ;)
> > > > If I got that right I should revert the changes in r4395 and
> > > > mkgmap
> > > > should not allow or warn loudly when a typ file with a
> > > > different
> > > > codepage is merged?
> > > > Or should we force the usage of unicode codepage?
> > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > other)
> > > > in a way that only those lines which contain non-matching
> > > > characters
> > > > are ignored?
> > > > 
> > > > Gerd
> > > > 
> > > > 
> > > > ________________________________________
> > > > Von: mkgmap-dev <mkgmap-dev-bounces at lists.mkgmap.org.uk> im
> > > > Auftrag
> > > > von Ticker Berkin <rwb-mkgmap at jagit.co.uk>
> > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > An: mkgmap development
> > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > 
> > > > Hi
> > > > 
> > > > A couple of problems with typ-files and unicode.
> > > > 
> > > > With 'Codepage=65001' the final contents of the labels in
> > > > mapnik.typ
> > > > that is included with the composite map is unicode, but if the
> > > > map
> > > > is
> > > > codepage 1252, the unicode characters with the top bit set are
> > > > simply
> > > > displayed as if in 1252.
> > > > 
> > > > Removing the codepage statement from mapnik.txt and making
> > > > fixes
> > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > and
> > > > then
> > > > generating a map with --code-page=1252, it gives the error:
> > > > 
> > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > >  (thrown in TypCompiler.makeMap())
> > > >  TYP file cannot be written in code page 1252
> > > > 
> > > > Changing the exception handling in
> > > > imgfmt/app/typ/TypElement.java,
> > > > so
> > > > that makeLabelBlock() reads as
> > > > ...
> > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > >     try {
> > > >         ByteBuffer buffer = encoder.encode(cb);
> > > >         out.put((byte) tl.getLang());
> > > >         out.put(buffer);
> > > >         out.put((byte) 0);
> > > >      }  catch (CharacterCodingException ignore) {
> > > > //        ignore.printStackTrace();
> > > >         String name = encoder.charset().name();
> > > >         System.out.println("Cannot represent String=" +
> > > >             tl.getLang() + "," + tl.getText() +
> > > >             " in CodePage=" + name);
> > > > //        throw newTypLabelException(name);
> > > >      }
> > > > ...
> > > > 
> > > > It gives output like:
> > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > Cannot represent String=21,Obszar przemysBowy in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > 
> > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > 0x15,
> > > > decimal 21).
> > > > 
> > > > NB the non ascii characters in above are messed up by my
> > > > cutting
> > > > and
> > > > pasting.
> > > > 
> > > > Checking the French, on my Garmin device, the type descriptions
> > > > now
> > > > display accents correctly.
> > > > 
> > > > Ticker
> > > > 
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > mkgmap-dev at lists.mkgmap.org.uk
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > mkgmap-dev at lists.mkgmap.org.uk
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > mkgmap-dev at lists.mkgmap.org.uk
> > _______________________________________________
> > mkgmap-dev mailing list
> > mkgmap-dev at lists.mkgmap.org.uk
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: typCodePage_v3.patch
Type: text/x-patch
Size: 12253 bytes
Desc: not available
URL: <http://www.mkgmap.org.uk/pipermail/mkgmap-dev/attachments/20200114/f1dd06de/attachment-0001.bin>


More information about the mkgmap-dev mailing list