[Ocaml-i18n] proposal: message catalogue system
Benjamin Geer
ben at socialtools.net
Tue Dec 2 10:53:03 PST 2003
I should introduce myself briefly:
I'm a programmer with a background mostly in Java, and I started
programming in Caml this year. I have an M.A. in linguistics; I speak
English and French, and I'm learning Italian and Arabic.
I've been thinking about writing a message catalogue system for Caml.
My motivation is that I need it for a web application, but I'm keen to
have it be suitable for Caml programs in general. I emailed some ideas
to Matthieu Sozeau (author of OCamlI18n); he suggested that we continue
the discussion on this list. He pointed me to an article about Perl's
Maketext:
http://www.icewalkers.com/Perl/5.8.0/lib/Locale/Maketext/TPJ13.html
I think it makes a lot of good points. I think its main insight is that
a translation in a message catalogue can be thought of as a function
that returns a string in a particular language, often including some
data that was passed to it.
Since it's a function, the next question is: what language should it be
written in? Locale::Maketext provides two languages: a simple 'bracket
notation', and Perl itself.
In Java, java.text.MessageFormat provides a bracket notation, with no
ability to fall back to anything more powerful. This looks to me like a
serious limitation. For example, its approach to plurals (in
java.text.ChoiceFormat) only allows you to specify different forms for
absolute ranges of quantities; this doesn't seem to be able to handle
Slavic-style plurals (see the Polish example in the GNU gettext manual:
http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150).
It seems to me that bracket notations are appealing because they allow
you to express the translation function as an *exemplar*, which is
simple and intuitive:
Your search returned {0} files in {1} directories.
However, the bracket notations provided by java.text.MessageFormat and
Locale::Maketext are not powerful enough to handle the more complex
logic that is necessary in order to generate plurals in some natural
languages. The solution to this problem in Maketext is to fall back to
using a different programming language entirely, in this case Perl. But
then the translator needs to learn two syntaxes; one of these is too
simple for the task at hand, and the other one is too complex. Wouldn't
it be nice if the translator could learn just one syntax, which allowed
him to express the translation as an exemplar, and which was also
powerful enough to handle the logic for plurals?
It seems to me that what's needed here is a template language. I've
written one, called CamlTemplate (http://saucecode.org/camltemplate).
For simple exemplars, it's as easy to use as bracket notation, e.g.:
Cannot open file ${a}.
Getting back to the issue of plurals, suppose your message just has to
say "n files", where n is a number. The English template could be:
#if (a == 1)
1 file
#else
${a} files
#end
The Polish one could be:
#if (a == 1)
1 plik
#elseif (a >= 5 && a <= 21)
${a} plików
#elseif (a % 10 >= 2 && a % 10 <= 4)
${a} pliki
#end
The article on Maketext suggests writing Perl functions to generate
plurals, and calling these functions from the translations. This could
be done in CamlTemplate as well, e.g. for English:
#macro quant(num, word)
${a}
#if (a == 1)
${word}
#else
${word}s
#end
#end
The English "n files" template would then become:
#quant(a, "file")
Of course you could expand this to handle the common irregular forms.
Since a CamlTemplate template can call Caml functions, simple
string-matching functions could be provided to do things like this:
#macro quant(num, word)
${a}
#if (a == 1)
${word}
#elseif (endsWith(word, "y"))
${stripSuffix(word, "y")}ies
#else
${word}s
#end
#end
The next question is: how do you access a translation from a program?
What do you use for a message key? Gettext uses the message itself in
some natural language (the one the programmer used); it reads message
directly from program source code. I think this has several drawbacks:
1. If the same message is used several times in the program, when it
changes, it must be changed in several places.
2. Representing the message as an exemplar might be complex in itself,
as in the examples above, thus complicating the program.
3. The programmer might not be the person who writes the messages;
having them in the source code is therefore an inconvenience,
particularly if they need to be written before the programmer starts coding.
The alternative is to use some arbitrary string as a key; this is the
approach taken by java.text.MessageFormat. I think it's a more
maintainable approach, because messages can be changed without changing
program source code.
Another question is: how do we store message catalogues? The problem of
character encoding comes up right away. I suggest that we store them in
XML files, because XML has built-in support for dealing with encodings.
So I propose something like this:
<?xml version="1.0" encoding="UTF-8"?>
<catalog lang="en">
<macros>
#macro quant(num, word)
${a}
#if (a == 1)
${word}
#elseif (endsWith(word, "y"))
${stripSuffix(word, "y")}ies
#else
${word}s
#end
#end
</macros>
<messages>
<message key="disk_full">Disk ${a} is full.</message>
<message key="files_in_dirs">There are #quant(a, "file") in
#quant(a1, "directory").</message>
</messages>
</catalog>
This could just be the default way of storing them; there could also be
an interface allowing catalogues to be loaded from any other source.
So to get a translation in a Caml program, I'm proposing a function like
this:
val msg : key:string -> args:string list -> string = <fun>
You could use it like this:
msg "files_in_dirs" [ file_count, directory_count ]
If you were using CamlTemplate in a web application, you could also call
this function in a template:
${msg("files_in_dirs", fileCount, directoryCount)}
The article about Maketext points out that you'll want to share some
functions between languages, or at least between different variants of
the same language. Currently in CamlTemplate all macros are global, i.e.
can be used by all templates. I'm thinking about adding a simple
namespace facility so that a template could be in, say the, "en.UK"
namespace; when it used the #quant macro, the template engine would look
for that macro in the "en.UK" namespace; if it didn't find it there, it
would look in the "en" namespace, and then in the default namespace.
I think that would take care of all the issues raised by the Maketext
article.
I'd love to hear some reactions to this proposal.
Ben
More information about the Ocaml-i18n
mailing list