Tags:
create new tag
, view all tags

Feature Proposal: Anchor names in non-ISO-8859 charsets to be similar to ISO-8859

Motivation

This is about %TOC% entries. A site employing a non-ISO-8859 charset such as UTF-8 makes a different anchor names from an ISO-8859 site from a source string. This is the case even with an ASCII-only anchor name source string, which may cause problems - Characters such as white space (0x20), ', ", <, and > may present in an anchor name if the site charset is UTF-8.

This may not have practical problems, but currently, with the section name "abc def", its anchor name is different on an ISO-8859 site and a UTF-8 site, which dosn't sound right.

Description and Documentation

If the site charset is a superset of ASCII encoding-wise (e.g. UTF-8), a similar character cleansing as ISO-8859's needs to take place. I mean the following processing in core/TWiki/Render.pm:makeAnchorName() by "character cleansing".

    if( !defined($TWiki::cfg{Site}{CharSet}) ||
          $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?/i ) {
        $anchorName =~ s/[^$TWiki::regex{mixedAlphaNum}]+/_/g;
    }
As you can see, the character cleansing doesn't happen with non-ISO-8859 charsets.

"Superset of ASCII encoding-wise" means that ASCII characters show up as they are and non ASCII characters don't have an ASCII byte.

  • UTF-8 meets the criteria of "superset of ASCII encoding-wise" because all non-ASCII characters consists of 0x80 to 0xff bytes.
  • EUC-JP and EUC-KR meet the criteria too.
  • Shift_JIS and Big5 don't because some characters have an ASCII byte in their second byte.

Character cleansing for non-ISO-8859 charsets would be as follows. As you can see, the following lines are to immediately follow the lines shown above.

    elsif ( $TWiki::cfg{Site}{CharSet} =~ /^utf.*8$|euc/i ) {
        $anchorName =~ s/[\x00-\x2f\x3a-\x40\x5b-\x60\x7b-\x7f]+/_/g;
    }
For an ASCII only string, this yields the same result as the first one. Non-ASCII characters are untouched, which behavior is different from the first one, but this should be better because the chance for an entire sources string to consist only of non ASCII characters is not low in non-ISO-8859 charset.

Examples

Impact

Implementation

-- Contributors: HideyoImazu - 2012-08-28

Discussion

I think you are referring to the automatic anchor name of HTML headings used by TOC, done in TWiki::Render?

Enhancing that is fine as long as it does not break compatibility. In the past we enhanced the anchor already, at that time we introduced two anchors, a compatible (old) one, and a new one. You could do his in the same way.

Reading the code, makeAnchorName() already does the filtering you propose. May be I miss something?

-- PeterThoeny - 2012-08-29

The clarify why to remain compatible: People sometimes use deep-links by clicking on a TOC and sending that URL by e-mail or past that into other places.

-- PeterThoeny - 2012-08-29

I made thing clearer at several sections above.

My proposal makes anchored URLs on the TWiki web the same in UTF-8 as in ISO-8859-?. It has no effect to sites employing ISO-8859-? as site charset.

-- HideyoImazu - 2012-08-30

Accepted by JerusalemReleaseMeeting2012x08x31.

-- PeterThoeny - 2012-08-31

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2012-09-14 - HideyoImazu
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.