Bug: Apache 2.0 Breaks Non-UTF-8 Encoded URLs on Windows
Summary
When using Firefox or similar browsers that don't send UTF-8 URLs by default, international characters in
WikiWords (as per
InternationalisationEnhancements) don't work in Apache 2.0.52 on Windows.
Partially fixed in Apache 2.0.54 for Windows but problems still occurring (see latest comments) - does not occur on other platforms.
Details
The problem is that whenever Apache sees
non-UTF-8 URLs (e.g. ISO-8859-1 URL-encoded with % escapes), it converts these to UCS-2 (two-byte Unicode format) before trying to pass them to TWiki. The conversion attempt includes the PATH_INFO (e.g. Codev/ThisTopic) that TWiki uses as the name of the topic, and fails because the URL is not valid UTF-8.
The result is that the TWiki code never sees the encoded URL and the page is inaccessible. The server gives a '500 internal server error' message and Apache
error.log has this line for an ISO-8859-1 test case:
(22)Invalid argument: utf8 to ucs2 conversion failed on this string: PATH_INFO=/Main/FromageD\xe9rap\xe9
This conversion is driven by new support for Windows Unicode filesystem APIs in Apache 2.0, and support for
UTF-8 URLs (IRIs) on Windows
, even though PATH_INFO at this point has nothing to do with the filesystem. (This is probably an Apache bug, since there's no obvious way to turn this off - I have not yet checked if already reported but there are some similar possible bugs, e.g.
ApacheBug:9223
,
ApacheBug:13029
,
ApacheBug:18805
and
ApacheBug:20855
, that result from Apache 2 assuming strings are UTF-8.)
--
RichardDonkin - 09 Dec 2004
Test case
1. Use any
I18N WikiWord, e.g. FromageDérapé, from Firefox 1.0 (or IE/Opera configured to not use UTF-8 URLs).
- You may not even need TWiki installed, so if you have admin access to an Apache 2.0 server, do try this out
2. See the error message and check Apache error.log file
Example Error
When using PHP's URL encoding function to access files with international characters a 404 file not found is returned. When first UTF-8 encoding the url and then URL encoding it - it works fine. Note the encoding differences in the URL: R%EAve and R%C3%AAve.
Examples from log file:
xxx.xxx.xxx.xxx - - [01/Jan/2005:18:23:03 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%EAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 404 260 "-" "WinampMPEG/2.9" "-"
xxx.xxx.xxx.xxx - - [01/Jan/2005:21:12:16 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%C3%AAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 200 8164000 "-" "WinampMPEG/5.0" "-"
--
FrancisLee - 06 Jan 2005
Environment
--
RichardDonkin - 09 Dec 2004
Fix
Install Apache 2.0.54 for Windows, which fixes this bug.
Follow up and Workarounds
Various possible workarounds are:
- Stay with Apache 1.3 on Windows (recommended in WindowsInstallCookbook and TWikiSystemRequirements)
- Set browsers to use only UTF-8 URLs (Firefox, Mozilla, etc)
- For Firefox, go to
about:config in URL bar, then filter by utf, then double click network.standard-url.encode-utf8 to set this to true (or just edit userprefs.js). Mozilla should be similar. NOTE: Firefox 1.0 has a bug that prevents this state being persistent, so you might need to reset it on every browser session - MozillaBug:261934
(
: fixed in Firefox 1.0.1)
UTF-8 URLs do work fine at least for page views and don't produce the same internal server error or message in the error log.
Thanks to
HenningRuch for encouragement to download XAMPP and find this problem. See
WebSearchProblemWithCygwinAndXAMPP for a XAMPP issue not related to
I18N.
--
RichardDonkin - 10 Dec 2004, 30 April 2005
Here's the reply from
Martin Duerst
of the
W3C, author of
mod_fileiri and expert on IRIs (internationalised resource identifiers, which can map into UTF-8 URLs/URIs) - posted with his permission:
... Re using mod_fileiri to work around this bug ...
I haven't tried this out. I originally wrote mod_fileiri for Linux
servers, where it's unclear what encoding is used for directory and
file names. It's possible to configure mod_fileiri to have filenames
in UTF-8, and accept incoming requests in a legacy encoding and
redirect to the corresponding UTF-8 file. I made this possible so that
existing Web servers could be converted from using a legacy encoding
for their URIs to using UTF-8, while old URIs would still work.
The conditions for this to work are that only one legacy encoding
can be used, that there are no collisions between legacy-encoded
filenames and UTF-8 encoded filenames (unless there is a large
number of files with some weird names, that's usually not a problem
at all), and that the site still works with (usually permanent)
redirects.
Looking at the bug description, '500 internal server error' looks
scary. A legal (according to the URI spec) URI should not produce
an internal server error. Whether the URI is UTF-8 or not is
besides the point. If a conversion fails when the server is looking
for a file, this should just result in 'document not found'.
My guess is that because mod_fileiri is implemented to run
in the 'fixup' phase (if the file is found otherwise, no need
to use the module), an earlier '500 internal server error' will
not allow it to come into action.
In my opinion, Apache on Windows should be fixed to return
'file not found' rather than an internal server error for
non-UTF-8 files/directories. Everything else I think is a
serious bug. If you file a bug on this, please tell me,
and I'll support it. My guess is that if this is fixed,
mod_fileiri should work.
From a user viewpoint, using UTF-8 for the whole wiki should
solve the problem, and is a good choice for many other reasons.
And of course, Firefox and friends should be fixed to deal
with IRIs, too.
Some more about
mod_fileiri from a followup message from Martin:
mod_fileiri can do this too, indeed it's what it was originally designed
for. It can also have the files in a legacy character encoding, and get
requests in that character encoding, but reply with a permanent redirect
to the UTF-8 version of that filename, and then reply with the actual
document to a UTF-8 version. I.e. you can pretend you already switched
to UTF-8 even if you haven't done so. Implementation of this working
mode was quite tricky, to avoid loops and other confusions
.
Clearly, if you can install
mod_fileiri on your Apache server, it's a good solution to the URL character encoding issue for all web applications, though probably not necessary for TWiki.
--
RichardDonkin - 13 Dec 2004
I've now logged this as
ApacheBug:32730
- you can monitor or support this by signing up to Apache Bugzilla, but please read their
bug writing guidelines
first and be polite
Note that
MozillaBug:261934
makes it painful to work around this by setting Firefox 1.0 to UTF-8 encode all URLs - although the setting in
about:config is persistent, it is ignored on startup, so you have to set it to
false again, then
true.
--
RichardDonkin - 16 Dec 2004
Fix record
I've looked at the Apache 2.0 code and commented on
ApacheBug:32730
- the real fix was to stop Apache converting certain environment variables to Unicode.
The fix was to add PATH_INFO to the conditional added for
ApacheBug:9223
at
line 529 in mod_win32.c (SVN)
. Details on
ApacheBug:32730
.
--
RichardDonkin - 20 Dec 2004
FrancisLee had this problem with a non-TWiki application using Apache 2.0, see above
#Example_Error. If anyone else has this problem, whether using TWiki or not, please comment here and on
ApacheBug:32730
(and vote on the latter!)
--
RichardDonkin - 06 Jan 2005
Good news - Will Rowe of the Apache team has accepted the patch and has said he'll commit it to Apache 2.0.53-dev and 2.1-dev. See
ApacheBug:32730
for his comments.
--
RichardDonkin - 07 Jan 2005
My first patch didn't quite work, see the bug report page for a revised patch that should fix this.
--
RichardDonkin - 10 Jan 2005
My new patch has now been applied to the Apache 2.1 code.
Please vote on ApacheBug:32730
to get this patch applied to 2.0!
--
RichardDonkin - 10 Feb 2005
It appears from the Apache
SVN repository that
ApacheBug:32730
was fixed in the 2.0.x branch (
SVN r153677) - so Apache 2.0.54 should include this fix. For anyone who needs a fix before then, apply the latest patch from the Apache bug report page.
To check out the Apache HTTPD 2.0.x branch, use:
svn co http://svn.apache.org/repos/asf/httpd/httpd/branches/2.0.x httpd-2.0.x
--
RichardDonkin - 31 Mar 2005
Apach 2.0.54 is now released, including the fix for
ApacheBug:32730
- this Apache release is recommended for anyone using TWiki
I18N on Windows. From the
Apache changelog for 2.0.54
:
- mod_win32: Ignore both PATH_INFO as well as PATH_TRANSLATED to avoid hiccups from additional path information passed in non-utf-8 format.
[Richard Donkin <rd9 at donkin.org>]
--
RichardDonkin - 28 Apr 2005
I still have the problem with Win32 Apache 2.0.54 and ActivePerl 5.8.4.
cgi-bin/printenv.pl tells that the environment variable PATH_INFO is already garbled when Perl code starts.
Maybe
prep_string() in
mod_win32.c doesn't work with my default code page.
Is there a way to get around PATH_INFO, e.g. extracting path info from REQUEST_URI?
--
KaoruMaeda - 01 Jun 2005
I added this in
setlib.cfg and now it seems working.
# -------------- Only needed to work around an Apache 2.0 bug on Win32
if (defined($ENV{'PATH_INFO'}) &&
defined($ENV{'REQUEST_URI'}) &&
defined($ENV{'SCRIPT_NAME'}) &&
$ENV{'REQUEST_URI'} =~ /\%/) {
my $req = $ENV{'REQUEST_URI'};
my $scr = $ENV{'SCRIPT_NAME'};
my $path = $req;
if ($path =~ s/^\Q$scr//) {
$path =~ s/\?.*//;
$path =~ s/\%([0-9a-zA-Z][0-9a-zA-Z])/chr(hex($1))/ge;
$ENV{'PATH_INFO'} = $path;
}
}
I also changed
Encode::encode call in
TWiki.pm.
&FB_PERLQQ causes an error and it should be written as
FB_PERLQQ()
--
KaoruMaeda - 14 Jun 2005
Not sure why this is not working for you - unfortunately I was not able to test my patch because Apache for Windows only builds with Visual Studio tools that I don't have. The Apache patch appears to be partially working, since the TWiki scripts actually run - previously they were prevented from running with an Internal Server Error (500).
I haven't got any time at present to look at this, having just moved house and started a new job, but if you could attach details of your testenv
HTML output, relevant Apache log file entries, your Apache version and setup including default code page, and so on, that may help someone else to figure this out.
From looking at
testenv output, specifically for its PATH_INFO test, the following environment variables should perhaps also not be converted to UCS-2 (though TWiki may not require all of these):
REDIRECT_URL (as mentioned above)
SCRIPT_URI
SCRIPT_URL
--
RichardDonkin - 17 Jun 2005
Re Kaoru's issue, this needs some more investigation, but at least Apache 2.0.54 or higher lets the TWiki code run, so I think this problem can be considered mostly resolved, with patch available. Updating
InternationalisationIssues.
I've commented on a related bug,
ApacheBug:34985
, and have exchanged email with an Apache developer who works on the Windows version, but I don't think this bug affects TWiki users.
--
RichardDonkin - 12 Nov 2006