UTF-16 and UTF-32 are unhandled

Wide characters should probably be supported, or, at the very least, warned about.

Test case:

mkdir -p ikiwiki-utf-test/raw ikiwiki-utf-test/rendered
for page in txt mdwn; do
  echo hello > ikiwiki-utf-test/raw/$page.$page
  for text in 8 16 16BE 16LE 32 32BE 32LE; do
    iconv -t UTF$text ikiwiki-utf-test/raw/$page.$page > ikiwiki-utf-test/raw/$page-utf$text.$page;
  done
done
ikiwiki --verbose --plugin txt --plugin mdwn ikiwiki-utf-test/raw/ ikiwiki-utf-test/rendered/
www-browser ikiwiki-utf-test/rendered/ || x-www-browser ikiwiki-utf-test/rendered/
# rm -r ikiwiki-utf-test/ # some browsers rather stupidly daemonize themselves, so this operation can't easily be safely automated

BOMless LE and BE input is probably a lost cause.

Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using &#xXXXX; or &#DDDDD; XML escapes where necessary.

Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.

Reading the wikipedia pages about UTF-8 and UTF-16, all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.

Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into CESU-8 instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.

An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.

-- Martin Rudat