[ocaml-i18n] Native UTF-8 strings
Richard Jones
rich at annexia.org
Wed Nov 26 01:52:46 PST 2003
On Wed, Nov 26, 2003 at 11:18:18AM +0900, Yamagata Yoriyuki wrote:
> We need a way to interact with the environment, like converting input
> string to Unicode (as pointed out by Rich), getting locale etc. It
> seems one of hardest part of all,
I think the real problem is that you have no way to know what encoding
your streams are using!
In very practical terms, is that file on disk ISO-8859-1, UTF-8 or
SJIS?
Perl solves this by allowing you to specify the encoding when opening
a file. eg:
open FH, "<:utf8", "file";
where the "<:utf8" argument defines a translation "layer" between the
file and the program.
I don't think this is a very elegant solution.
Anyway, leaving such issues aside, I think these steps would help:
* Have ML files be UTF-8 encoded by default. The current choice,
ISO-8859-1, is arbitrary, and essentially derived from the fact that
OCaml was developed in Europe. So if you're going to choose something
arbitrary, either strict US-ASCII or UTF-8 seem like much more
sensible choices.
* Add \U escape sequences for string literals.
* Store strings internally either as wide chars or as UTF-8, and
change the String module accordingly to work on characters.
* For byte strings, have a separate bytestring type. Using type
'string' when you really want bytestrings always seemed to me to be a
hack.
Rich.
--
Richard Jones. http://www.annexia.org/ http://freshmeat.net/users/rwmj
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
C2LIB is a library of basic Perl/STL-like types for C. Vectors, hashes,
trees, string funcs, pool allocator: http://www.annexia.org/freeware/c2lib/
More information about the Ocaml-i18n
mailing list