P18 Internationalizing Preprocessor

Documentation

Previous  Next  .  Contents
About  .  Documentation  .  License   .  Download

Encoding

P18 can process only 8 Bit data files not containing null characters, which are compatible with US ASCII character set. Fortunately this is true for most relevant character encodings. Input files not compatible with US ASCII can not be processed directly and must be converted to an ASCII compatible encoding. The following encodings can be handled without conversion:

Since UTF-8 can be used to encode about anything, the 8 bit limitation of P18 is not really a limitation. It is possible that future versions of P18 will support UCS-2, UCS-4, or other non-ASCII compatible encodings by performing an initial transformation to UTF-8 on the input and a final transformation back to the original encoding on the output.

Recognized Message Types

Recognized message types never start with an underscore, and future versions of P18 won't define additional recognized message types starting with an underscore. You may want to prepend your own message types with an underscore to avoid a collision with a recognized message type.

The message type of a message determines the encoding, but may also determine some other aspects of how the message text is interpreted. The following message types are recognized by the current version of P18:

Java Encoding
Java uses the backslash character as an escape character, just like P18. In P18, all backslash characters that are to appear in the output file have to be quoted. This is also the case for backslash characters which act as an escape character in the output file. When handling Java string literals, P18 recognizes the resulting double-backslash quoting sequences and transforms these strings to UCS-4 internally. The only thing to remember when writing P18ized Java files, is that the escape sequences in the Java string literals have to be introduced by a double backslash instead of a single backslash.

If an I18N escape of type JAVA contains non-ASCII characters, these characters are interpreted as ISO Latin 1 characters.

Translation File Encodings

Translation files may be written in different encodings. Only the message text parts are written using the specified encoding, the meta information of the translation file is written in plain ASCII. The default encoding for translation files is ISO 8859-1 (Latin 1). The encoding of a translation file has to be roughly ASCII compatible (i.e. UTF-7 and HTML are considered ASCII compatible, while ENCDIC is certainly not). I don't recommend using UTF-7 for translation files (try it and you'll see why).

A translation file is written only if all characters of all messages can be encoded using the requested translation file encoding. For most translations, one of the ISO 8859 encodings will do. However, if you wish to translate hebrew to russian for example, you'll probably want to use UTF-8 or HTML as the translation file encoding.

An other possibility is to use the -f option of the db export command (forced export). This option will tolerate unencodable characters when writing the translation file. This is useful if messages of the source language are likely to remain readable even if some special characters are missing (e.g. when translating from german to greek one might want to use the greek encoding ISO 8859-7, even if the german umlaut-characters can't be encoded correctly).

The translation file encoding can be specified using the -e option of the db export command (see section Commands).


Previous  Next  .  Contents
About  .  Documentation  .  License   .  Download