Tags:
create new tag
, view all tags

Unicode Support for MacOS X

This is very difficult to do along with UseUTF8 for other platforms, due to MacOS X's use of two Apple-specific variants of Unicode NFD (see UnicodeNormalisation). Hence it is out of scope for UseUTF8 at present.

Summary: MacOS X likes to 'break apart' some accented characters, Korean characters, etc, into their component parts - e.g. a single codepoint is turned into two codepoints for 'u' and 'umlaut'. This is called decomposed Unicode, formally known as Unicode Normalisation Form D (NFD). Everyone else in the computer industry doesn't do this decomposition, leaving such 'precomposed' (Normalisation Form C, NFC) characters as they are. This difference causes major interoperability problems for TWiki and many other applications when supporting Unicode.

  • Details:
    • MacOS X as a server encodes filenames on the default HFS+ filesystem in a unique flavour of Unicode, using one of two Apple-specific NFD normalization types, depending on MacOS version (see UnicodeNormalisation), whereas the rest of the world uses NFC normalisation (i.e. W3C, Linux, Windows, etc) - MacOS actually stores them in a 16-bit encoding of Unicode, but NFD is the problem. Many Mac applications and OS tools use Apple's NFD, but some use NFC (e.g. the Terminal window, bash shell, etc).
      • This means that simply getting TWiki to create Unicode filenames on most Mac servers will cause some issues -TWiki will try to create NFC UTF-8 filenames, which get converted to either UCS-2 or UTF-16 (16-bit Unicode encodings), but using NFD not NFC. The risk is that NFD is then presented back to the web or email client, and in most cases the I18N characters aren't viewed properly, unless TWiki has converted on the fly from NFD to NFC. This may "just work" without any extra code, but needs testing.
      • MacOSXFilesystemEncodingWithI18N has the gory details from when this happened with ISO-8859-1 filenames.
    • MacOS X clients pose similar problems - they can use a different Apple-NFD variant to a MacOS server (depends on MacOS version). This means that files can potentially be lost through the interaction of a Mac server and a Mac client (e.g. syncing files to and from a server) - see Wikipedia UTF-8 page's section on normalisation.
      • Attachments from Mac clients using NFD are the biggest problem. Topic contents using NFD will also be a problem, as some non-Mac browsers don't support NFD.
      • It's not clear whether Safari / Webkit does normalisation to NFC - but this bug was fixed in 2006, so it seems to use NFC.
      • Compatibility with non-Mac servers is the biggest issue.
    • A mixed environment (either client or server is non-Mac, or the MacOS versions are sufficiently different between client and server) will be more of a challenge, as in some cases conversion to/from decomposed (NFD) UTF-8 will be necessary.
    • Batch migration:
      • data conversion: iconv on Mac, Darwin and MacPorts includes a utf-8-mac encoding which is decomposed (Apple-variant NFD) UTF-8. However, iconv on other platforms doesn't seem to support this.
      • pathname conversion: convmv has some support in this area for batch filename conversion to/from standard NFD - however, it's not clear if it is usable with MacOS's two non-standard Apple NFD variants
      • recent versions of Samba and rsync include iconv support, enabling pathnames to be converted to/from utf-8-mac in various situations
    • See Git email list thread for some information amongst the flames. This thread on MacFUSE (which enables new filesystems on Macs) has some discussion too. Simple test script for MacOS filename UnicodeNormalisation behaviour.
      • Even worse: there are two Apple variants of NFD and you can't just use the standard NFD when converting filenames, meaning a lot more code to support this.
      • Apple recommend in this technote on HFS+ that you don't use any standard library/platform implementation of UnicodeNormalisation, as it can't be guaranteed to (a) be identical to the Apple-NFD variant in use by your version of MacOS and (b) never change in the future. This increases the amount of work dramatically...
    • MacOS X's UFS and NFS filesystems [[http://developer.apple.com/mac/library/qa/qa2001/qa1173.html don't force use of NFD] but Finder-created filenames on UFS/NFS may be in NFD - one option may be to mandate use of UFS for a Mac server's TWiki files, so only client-created NFD problems must be solved. You can host a UFS filesystem within a large file on an HFS+ filesystem using mount -o loop ... or hdiutil (see this)
    • RCS, grep and other tools used with TWiki may have problems unless TWiki normalises all pathnames to a single form.
    • The LimeWire development proposal for Unicode support has some good thinking about how to address this issue. Linus Torvalds' investigation of UTF-8 NFD support for git is useful as well as entertaining.
    • Subversion thread is also useful, mentions various cases and filesystems. SVN in Macports repository has a fixed version as a build option.
    • My view: Apple have really shot themselves in the foot with this - it's caused problems with almost all cross platform applications using Unicode, including Samba, git, rsync, 7zip, LimeWire, Gnutella, XBMC, iTunes, etc, and has caused people to lose data in some situations.
  • Short term:
    • Document that MacOS clients and servers should not be used with UseUTF8
    • If possible, generate warnings if a MacOS client or server (or iPhone?) is detected.
    • Consider detecting the 'forces NFD' behaviour of filesystem used for TWiki directories at TWiki installation time - if filesystem is UFS or another NFD-free filesystem, we can work in the sane NFC mode, with luck.
  • Strategy (when eventually implemented):
    • Use UTF-8 in NFC as the core format for all TWiki data and pathnames - accept and generate this on all platforms, and all interfaces
    • When using existing filenames on a MacOS X server, convert from NFD pathname format to NFC
      • Investigate which codepoints are converted differently by the two Apple-NFD variants, compared to the selected Unicode version's NFD - might be possible to do a supplementary 'fixup normalisation' to generate correct NFC.
    • For Mac clients, convert their NFD inputs (form edits, URLs, etc) into NFC for use by TWiki. Don't convert NFC data into NFD for output to Mac clients, as they can handle NFC.
    • This limits the damage to NFD pathnames on the Mac-based TWiki server - everywhere else, NFC is used for all input/output, processing and storage, even on Mac servers.
    • We should never need to convert NFC to NFD, because MacOS always accepts NFC, converting it to NFD in many cases and particularly for HFS+
    • To be investigated: possible use of NFD format when doing searches that ignore both case and accented characters? Probably useful, but can be done as needed for searching/sorting without affecting core format - can simply convert to NFD on the fly (for traditional TWiki searching) or could use it as the indexed form of words with TWiki search engines such as Plucene.
Since the iPhone is MacOS X based, it is very likely to have the same problem, making it a problematic TWiki client for Unicode users... however, the solution of using NFC only in the Strategy above should work.

-- Contributors: RichardDonkin - 2009-10-19

Discussion

Thanks for sharing this Richard. I was not aware of this issue. Yes, support for UTF-8 on Mac is out of scope of initial implementation.

-- PeterThoeny - 2009-10-20

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2009-10-23 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.