As part of
UseUTF8, TWiki will by default expect and use Unicode NFC (Normalisation Form C), like Linux and
W3C standards.
Any support for conversion of NFD (Normalisation Form D) to NFC (for support of
MacOSXFilesystemEncodingWithI18N and any plugins returning data in NFD, e.g. from a database) is to be determined - see
UseUTF8 and
UnicodeMac for details.
Decomposition
Decomposed characters are examples of Unicode
composite character sequences - sequences of codepoints that must be handled as a single character in most cases - often a letter and accent/diacritical marks. One effect of normalisation is to ensure that such composite character sequences include codepoints in a well-defined order - for example an acute accent might always be placed before a cedilla (fictitious example), so that there is only one possible sequence of codepoints for this character as shown on the page.
Precomposed characters use a single codepoint for a character that would be represented as a sequence of composite character codepoints in decomposed form.
Normalisation
Normalisation is essentially taking an input sequence of Unicode codepoints and transforming this into a canonical sequence with well-defined rules as to which characters must be decomposed and which precomposed, and also ordering codepoints within a composite character sequence.
Some browsers support NFD (IE5.5), some don't (Konqueror 3.1.1 and Mozilla Firefox 0.8), and most importantly use of NFD in web pages or
XML is
against W3C standards - so, for some UTF-8 TWiki sites it will be important to convert NFD to NFC (mainly an issue for filenames with
UnicodeMac, and to some extent for topic contents).
Even Mac-only sites will require conversion from NFD, since NFC is much more convenient for use within TWiki and across different systems. Using NFC makes it possible to process and compare Unicode strings
for most European languages without considering more than one character at a time (e.g. regexes will work on a single codepoint for
ä), while still enabling users and third party data sources to encode the same character in different ways (e.g. in Vietnamese using two combining characters for accents in different orders).
NFC also simplifies conversion to legacy non-Unicode character sets such as ISO-8859-*, even if some data sources (e.g. plugins) use the decomposed forms, i.e. with combining characters for accents etc. See
MacOSXFilesystemEncodingWithI18N for more on this topic.
It seems that
MacOS X transparently converts all NFD filenames back into ISO-8859-1 if that's the network charset, but when TWiki is in UTF-8 mode there would be no transparent conversion from NFD since Apple expects applications to deal with NFD directly: if a browser expects NFC and the server sends NFD (particularly for attachments), the user won't see properly rendered accents.
UTF-8 sites that know all users and data sources will be NFC based (e.g. no
MacOS clients or servers) will not have to worry about this.
Normalisation is also involved in
UnicodeCollation.
Resources:
--
RichardDonkin