[gui] Unicode filenames on Macintosh HFS+ and international interoperability

Philippe Verdy verdy_p at wanadoo.fr
Mon Jan 27 04:26:07 EST 2003


I have received a report that was extremely strange as it reports an
internationalization problem which is caused by a bug (or "feature" ?) in
the Apple version of Java on Macintosh (on both MacOS 8/9 and MacOSX), and
that alsocause interoperability problems between various versions of MacOS,
or with Windows and Unix.

This mail is quite long. Sorry but it requires extensive comments to
understand why this is an issue, and why the proposed patch is necessary and
safe. Its solution affects both the core and the GUI of LimeWire.

It is related to a very controversed choice made by Apple to encode
filenames on HFS+ with Unicode but transforming them first to a Apple HFS+
specific "canonical" form, which ressembles to the Unicode NFD form
(decomposed), but does not use the standard NFD algorithm to store files.
However, this choice, that should have been kept internal to the filesystem,
is exposed to applications that will not be prepared to handle characters
encoded with several Unicode characters.

The problem is that, when listing files in a directory on a HFS+ volume,
MacOS and MacOSX report to the application this decomposed form. The main
effect is that the OS is handling filenames differently if the file is
created on a HFS+ filesystem (this bug does not affect HFS filesystems,
which are using a single-byte encoding such as MacRoman, or other legacy
national encodings where characters are stored internally with their
composed form).

In some cases, it simplifies the code in applications that want to perform
caseless compares of strings, as they may simply ignore the separate
non-spacing accents, and transform the base character only. But it also
produces bugs, because one cannot safely convert a base character and keep
its separate accent to produce a sequence encoding a valid character.

The problem with this HFS+ bug, is that all applications are exposed to
different Unicode encoding for theorically the same filenames (if you store
on a HFS+ a file with composed characters, the HFS+ driver in MacOS or
MacOSX will force the decomposition of the given filename, and so will use
the same physical filename as if it were initially given in the decomposed
form).

This is documented in:
http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
with an exact conversion table in:
http://developer.apple.com/technotes/tn/tn1150table.html

There is absolutely no problem in the way filenames are stored on the
physical HFS+ filesystem (despite this encoding is not using the industry
standard that prefers the NFC normalization form as it is more compact and
more easily interoperable with legacy non-Unicode encodings), but in the way
MacOS or MacOSX expose filenames to applications (including the Apple Java
VM): if the application uses the new Unicode interface (and not the legacy
MacRoman interface which is not usable to store filenames with the full set
of Unicode characters), then it will have to handle that form.

Shame, the GUI part of the OS (the Macintosh Finder) is exposed to this
issue, like other applications. But due to a limit in the TrueType font
renderer on MacOS/MacOSX (which do not handle additional OpenType properties
stored in TrueType fonts to handle composition rules, and thus cannot
display correctly characters with accents, as these accents are now encoding
with separate non-spacing marks that the font renderer cannot place and
render correctly), this internal OS bug, must be caught by applications,
that must perform the reverse composition of characters that were decomposed
authoritatively in HFS+.

All applications on MacOS or OSX are concerned, as soon as it uses a HFS+
volume! Additionally, in TrueType fonts for MacOSX, have some *spacing*
glyphs associated with non-spacing accents, but TrueType fonts for MacOS do
not contain these additional glyphs, and so if a Unicode string is used,
containing a filename as returned by the Os File API with HFS+ volumes, to
display the filename directly in the application, these "supplementary"
characters will appear as a "?" after the base letter.

The other impact is that users cannot share successfully on Gnutella, all
files stored in HFS+ volumes: the keyboard driver will most likely allow
them to enter strings directly in the composed form (for example, if a user
types on the "é" key of a French keyboard, the application will see a unique
character in its input field (here: LATIN SMALL LETTER E WITH ACUTE).
Sending such search strings on Gnutella will allow them to download files
from Windows and Unix users, or from MacOS users sharing files on a HFS (not
HFS+) volume.

But if a Mac user shares the file "café.txt" stored in its HFS+ volume as
"cafe?.txt" where "?" is the non-spacing accent, all other users on the Gnet
(including MacOs and MacOSX users!) will be incompable of finding it: this
HFS+ specific represenation breaks the QRP hashing mechanism, and the string
is modified in a invisible way. If a Windows or Unix users really finds it,
it will be able to download from it. But neither MacOS or OSX users will be
able to download this file from their Mac!

Many servents will not display the filename correctly (it may be shown as
"cafe?.txt" in the results list, where "?" is an question mark character, or
a small rectangular box, despite the composed "é" character is present in
Macintosh fonts!). This is extremely tricky for all users, because
decomposed strings are rarely handled correctly in their OSes.

Apple should fix its OSes, (or its Java VM port in MRJ) but for now we have
no other choice than fixing what we get from java.io.File.getName(): we must
recompose all the characters that were decomposed in the HFS+ driver. In
recent versions of MacOSX, Apple has worked only to fix its TrueType font
renderer, so that it will support OpenType extensions present in fonts, that
allow displaying strings using decomposed characters, or handle correctly
scripts that use alternate contextual glyphs for the same character: for
example Arabic with initial/medial/final/isolated forms of characters that
will also have diacritics used for vowels and voice marks, or nearly all
Indic languages, where vowel signs and "virama" often form special ligatures
with their base consonnant character or with the next base character, but
also English with sequences like "ffi" which use special ligatures to
produce typographic enhancements, or German, Old English, and Old French
where some "s" use the long form similar to a "f" without the dash, this
form also producing ligatures when followed by a "t" (there exists two sorts
of ligatures for "st" and "St") or another "s" ("ss" in that case use a
ligature between the long form of the first "s" and the small form of the
second one to produce the "ess-tzett" ligature commonly used in modern
German as a mark of the traditional typography).

Complete support of OpenType is now part of Windows XP, and is also present
in a limited way in previous versions of Windows, or as an add-on for
internationalization of Internet Explorer. But without it, or on Unix and
Linux (where support for OpenType and even TrueType is missing in X11
implementations!), it is absolutely required to use precomposed characters
when displaying Unicode strings, i.e. the NFC form as documented in the
Unicode.org reference (the NFD form is only recommanded for internal
management of strings, but should not be used for any final output, such as
display, printing, storage, or communication).

Shame, the Apple technote does not seem complete, as it omits some
diacritics used in Japanese, and that HFS+ also decompose. We cannot use
also any standard NFD/NFC table, because HFS+ does not behave the way
defined in these standard normalization forms. That's why NTFS or FAT32 on
Windows doesnot attempt to modifiy the normalization form of any filename,
normalization is left to the application, which most likely will use strings
coming from supported input methods that always return precomposed
characters when possible.

The Apple technote currently indicates that thisrecomposition will be
absolutely necessary 913 composed characters, plus aroung 17000 composed
Hangul syllables (but they are computed algorithmically from their component
"Jamos": A Jamo is either a base vowel, or consonnant or some vowel signs
and conjoining vowels or consonnants in the Hangul script, and that are used
to write Korean, and written in the same visual box to form syllables, and
Unicode "syllables" can precompose only 2 or 3 Jamoos in a single character.
But Unicode cannot represent all syllables with a single code, so these
Hangul syllables are just special presentation forms for common syllables,
so that fonts can be made to represent most usages of the Hangul script in
modern Korean).

Correcting this issue will require that we change all calls to
File.getName() to process the result so that all decomposed characters
returned by this call will be recomposed. This is needed only on MacOS and
MacOSX, and for compatibility, this should not be done on Windows and
Unix.Linux.

There are two ways to perform this: create a derived class for File, and
check in that class the Unicode encoding of names returned by java.io.File.
Or write a separate utility class to recompose characters everywhere on the
code where File.getName() is used. But in both cases, this should be done
with a prior check if running on MacOS and MacOSX (the only platforms where
this recomposition is safe, because a file can be reopened in any case using
recomposed characters, despite it was stored with decomposed characters).

Another user has already experimented on its Mac a small patch that just
corrects a few characters commonly found in ISO-8859-1 (for example E+ACUTE,
or A+DIERESIS), in Japanese (Katakana characters with voice marks). This has
already been implemented in other servents for Mac written in Java (XNap and
others...)

A more complete set is needed to support Hebrew and Arabic (cantillation
marks), and many European languages, and the recomposition is absolutely
required to support Korean. I am working with that Japanese user to fix this
issue, using the table reported in the Apple technote (which does not
clearly explain what impact it can have for interoperability of applications
or volumes).

I have started a class that recomposes characters that were unwisely
decomposed by HFS+. This class will recompose the 913 combinations indicated
in the Apple technote, and the Korean Hangul syllables. The effect can be
immediate when simply displaying on Mac OS the files present in the shared
library in LimeWire: without this patch, LimeWire will not display the
directory content correctly (this is not really a bug in LimeWire, or in
Java, but really a bug in MacOS and MacOSX related to the integration of the
Mac-specific HFS+ filesystem, that neither Java nor LimeWire currently
corrects)!




More information about the gui-dev mailing list