P18 Internationalizing Preprocessor

Documentation

Previous  Next  .  Contents
About  .  Documentation  .  License   .  Download

Concept

P18 is a macro preprocessor capable of performing a message based translation of the input files. The basic idea is similar to the gettext() approach popular for programs written in ANSI C. However, in contrast to the gettext() approach, the files are translated as a whole and once for each target language, so you'll get a set of translated files from a single source file.

The idea of gettext() is to mark all translatable strings in a program using a special macro named _ (a single underscore). A simple file scanner can extract all these messages and create a message file. The message file is then translated to various languages, and the _ macro calls a function named gettext(), which uses these translation files to translate the messages on the fly. (This is a bit oversimplified, but that's the basic idea.)

An input file for P18 uses a syntax similar to the invocation of the _ macro to mark all translatable messages. Here's an example of an p18ized HTML file:
## define LANGUAGE en
<html>
<head><title>_(Welcome)_</title></head>
<body>
<h3>_(Welcome)_</h3>
_(Blah...)_
</body>
</html>

The first line defines the variable LANGUAGE to "en", which indicates thet the messages in the file are written in english. The pairs _( / )_ mark the translatable messages.

So far, this works well for static messages, but we'll run into problems if we want to translate a dynamic message string, e.g. a message string generated by a PHP script. To solve this, P18 provides a way of passing parameters to message strings. Inside the message string, these parameters may be referenced as "$n", where n is the position of the parameter. Example:
<h3>_(Pages $1 to $2){<?=$list_start?>}{<?=$list_end?>}_</h3>
Assume the message string "Pages $1 to $2" is translated to "Seiten $1 bis $2", then the P18 will replace the above message escape by:
<h3>Seiten <?=$list_start?> bis <?=$list_end?></h3>

Language Identifiers

P18 language identifiers are made up of alphanumeric characters, underscores, and colons. Colons are used to separate the main language identifier from a list of localizations. The syntax for a language identifier is:

  language-code:localization-1:localization-2...

The list of localizations may be empty. Whenever a translation of a message to a language specified by a language identifier id is reqested, P18 starts looking for a translation matching the entire identifier. If no translation is found, P18 starts stripping of localizations from the right of the identifier, either until a translation is found or the entire identifier is stripped away.

P18 itself does not care about your convention of building language identifiers. There are several conventions for encoding languages as language identifiers, the choice is up to you.

Message Types

It is possible to associate a message with a message type. Two messages are considered the same if both the message text and the message type match. The default message type is "TEXT".

A message type is a sequence of alphanumeric ASCII characters, underscores and hyphens. You can use arbitrary identifiers for message types. However, some message types are recognized by P18 and treated in a special way. It is guaranteed that message types starting with an underscore will not be recognized by P18, so you may want to prepend your own message types with an underscore to avoid a collision with a recognized message type. See section Recognized Message Types for a list of recognized message types.

There are two ways of specifying the message type. The first is by setting the P18 variable TYPE, the other is by using a message type option in the message escape (see section P18 Syntax).

Message Options

It is possible to specify message options in the message escape. Message options may be used to specify the language identifier and/or the message type of a particular message. This way it is possible to have messages written in different language or of different messages types in a single P18 input file. See section P18 Syntax for details.

Message Variants

P18 reqires all translations to be uniqe in all directions. However, sometimes the same message in one languages reqires different translations for another language. The most critical messages in this respect are single word messages. The english message string "Top" for example will have different german translations, depending on the context. To work around this problem, P18 allows you to specify an additional context information which is part of the message string, but is stripped away when the preprocessor is run. A message with such an additional context information is called a message variant.

A message variant is specified by appending the additional context information to the message text, separated with two percent signs. So you might have the english message "Top", which translates to "Oben", and the message "Top%%of-page", which translates to "Zum Seitenanfang".

Message Normalization

Some messages are too long to fit into a convenient line length, so messages are broken into several lines, maybe even with additional whitespace characters at the beginning of each line for indenting. As a consequence, it can happen that two messages read the same, but differ only in the distribution of whitespace characters. It is desirable not to destinguish between messages that differ only in whitespace. To achieve this, P18 normalizes all messages before processing them. Normalization of a message means the following: Parmeters passed to a message are normalized using a weaker algorithm (called weak normalization). Weak normalization means the following: This means the parameters may be broken into several lines, even using indetation, and the linebreaks and indetation are normalized to spaces. However, all other whitespace characters are left unchanged.

The Translation Database

The P18 uses a translation database for looking up message translations. The translation database is typically read on startup. The name of the database file may be specified on the command line or given as an option in a configuration file. It is possible to load a translation database file later, using the db command (see section Commands).

In contrast to most intenationalization schemes, P18 makes no destinction between source and target languages. A set of messages with different language identifiers are combined to a message set, and all languages present in the message set may be translated in all directions. This means that if you have a source file written in language A, you can change it to contain messages in language B without changing the translation database.

Typically, every message set is associate with a uniqe message set identifier. However, it is possible to create message sets without a message set identifier. Message set identifiers are useful when a translation file is generated (i.e. a file containing a translation template from one language to another). When the filled-in translation file is fed back into the translation database, the message set IDs may be used to identify the message sets for the translated messages, even if there have been minor modifications to the original messages (e.g. fixed a typo).

It is possible to change a message in the message database without changing the message set identifier. However, one has to be careful when doing this, since imporing a translation file generated earlier may lead to false translations. As a general rule, one should create a new message set (with a new message set identifier) whenever the meaning of the message changes. On the other hand, if the modification is just a fixed typo, it is safe to retain the existing message set identifier.

Translation Files

A translation file is a file containing messages from a small set of languages. The file format for translation files is kept clear and simple, so it can be filled in by a translator. In most cases, a translation file is generated for a translation from one language to another. However, technically there is no destinction between a source language and a destinction language in a filled in translation file. A translation file may also contain more than two language identifiers, so it is possible to create a translation including a set of localizations.

When translating a set of files, the typical approach is to create an initial message database by scaning the source files. The resulting translation database then contains a message set for every message found, every message set containing only that single message. In most cases all messages found will be in the same language, but it is also possible to have source files written in different languages. The next step is to create translation files for all supported languages, and have these translation files filled in by a translator. The filled in translation files are then fed back into the translation database.

It is likely that the set of messages changes over time. However, the modifications will only be done on the source files, in the respective source language. Before a release is made, P18 can be used to scan for new messages and insert them into the translation database. The maintainer will then again generate translation files for all supported languages, this time only listing the messages that don't have a translation.

As you can see, the translation files are used for communication only. Translation file can (and should) be discarded as soon as they have been fed into the translation database.

The exact syntax of translation files is described in section Translation Files.

Macros and Variables

In addition to messages escapes, the P18 preprocessor is capable of expanding variables and macros. In P18, variables and macros are the same thing, almost. The destinction between macros and variables is made when the variable or macro is expanded. Expanding as a macro causes the expanded text to be parsed again by the preprocessor; expanding as a variable sends the expanded text straight to the output.

The expansion text of an expanded macro is parsed as if it was read from an input file. It is possible to use any preprocessor directive in the expanded text.

The syntax for expanding variables and macros imitates the variable substitution syntax used by the "config.status" script generated by a GNU autoconf style "configure" script, i.e. the name of the variable is enclosed in et-signs (@). An expample for expanding variables:
_(This is $i, version $2)[en/LATIN-1]{@PACKAGE@}{@VERSION@}_

Expanding a macro is similar, the macro parameters are passed to the macro as a comma separated list put in parentheses. Example:
@FEATURE(A)@
... code relevant only if feature A is enabled ...
@FEATURE_END()@
Note: There's a subtle difference between a macro call with no arguments and a macro call with a single empty argument. The call @MACRO()@ calls the macro MACRO with no arguments, the call @MACRO( )@ calls the macro with a single empty argument (note the space character between the parentheses).

Macro definitions are mapped to variable definitions using a simple mangling scheme. For macros with no parameters, the name is the same as for a variable. I.e. you can use an empty pairs of parentheses to expand a variable as a macro, causing the expanded text to be parsed again. For macros with parameters, the mangling is a bit more complex: The macro body is bound to a variable with a name constructed using the following scheme:

 macro-name$parameter1$parameter2...

If the macro takes a variable number of arguments, the mangled macro name ends with a single dollar sign ($). If default values were specified for some of the parameters, the default values are bound to variables named:

 mangled-macro-name$$parameter

Macros and variables can be defined using the ##define and/or ##macro directives (see section Syntax for details). The ##macro directive is just a convenient shorthand for a series of ##define directives defining the macro and its default parameters. ##macro also takes care of creating a correct mangling.

Conditionals

P18 provides a set of conditional directives, similar to the conditional directives provided by the ANSI C preprocessor. However, as the output of expanded macros is reparsed, conditional directives can also be used in macros. It is even possible to have unbalanced ##if/##endif conditionals in expanded macros (this is a likely implementation of the FEATURE and FEATURE_END macros used in the example above).

The conditional directives of P18 are ##if, ##else, and ##endif, see section Syntax for details.

Conditional Output

It is sometimes desirable to exclude some input files completely, based on the values of some variables from a "config.status" file for example. The P18 directive ##condition does that job. If the ##condition directive appears before the first output character in the file, and the condition specified evaluates to FALSE, then no output file is generated for the input file (see section Syntax for details).

Operation

The input files for the P18 preprocessor are typically specified using a configuration file (see section Configuration). The configuration file combines sets of input files to so called input objects. Every input object has a symbolical name, and that name is used to reference the input object on the commandline or in a P18 command.

It is also possible to specify input files or input directories on the command line. If this is done, the specified files are available through an input object named ARGS.


Previous  Next  .  Contents
About  .  Documentation  .  License   .  Download