Main Page | Modules | Alphabetical List | Data Structures | File List | Data Fields | Globals | Related Pages

UNICODE Strings


Detailed Description

Utility functions for handling UNICODE and extended UNICODE strings.

This module provides functions operating on UTF-8 encoded UNICODE and extended UNICODE strings (an extended UNICODE strings as a string containing variable reference enclosed in matching pairs of ESC-STX and ESC-ETX pairs).


Files

file  unicode.h
 Public include file for the UNICODE utilities module.

Defines

#define SOBJ_UC_ASCII_ESC   ((char)0x1b)
 The ASCII ESC character.
#define SOBJ_UC_ASCII_STX   ((char)0x2)
 The ASCII STX character.
#define SOBJ_UC_ASCII_ETX   ((char)0x3)
 The ASCII ETX character.
#define SOBJ_UC_NEXT(P)
 Advance P to the next UTF-8 character.
#define SOBJ_UC_PREV(P)
 Rewind P to the previous UTF-8 character.

Enumerations

enum  sobj_uc_sig {
  SOBJ_UC_NONE = 0, SOBJ_UC_UTF8, SOBJ_UC_UTF16BE, SOBJ_UC_UTF16LE,
  SOBJ_UC_UTF32BE, SOBJ_UC_UTF32LE, SOBJ_UC_SCSU
}
 Enumeration of UNICODE signatures. More...

Functions

size_t sobj_uc_strlen (const char *uc_string)
 Compute the length of a UNICODE string.
size_t sobj_uc_strnlen (const char *uc_string, ptrdiff_t buffer_size)
 Compute the length of a UNICODE string.
enum sobj_uc_sig sobj_uc_string_sig (const char *string)
 Check if a string starts with a UNICODE encoding signature.
const char * sobj_uc_string_skip_sig (const char *uc_string)
 Skip the UNICODE UTF-8 signature.
const char * sobj_uc_string_index (const char *uc_string, size_t index)
 Find the character at the specified index.
unsigned sobj_uc_code (const char *uc_char)
 Return the UNICODE code point of a UTF-8 character sequence.
const char * sobj_ucx_next (const char *ucx)
 Skip to the next UTF-8 character or variable referece in an UCX.
const char * sobj_ucx_next_bounded (const char *ucx, const char *ucx_end)
 Skip to the next UTF-8 character or variable referece in an UCX.
const char * sobj_ucx_next_strict (const char *ucx)
 Skip to the next UTF-8 character or variable referece in an UCX.
size_t sobj_ucx_strlen (const char *ucx)
 Compute the length of an extended UNICODE string.
size_t sobj_ucx_strnlen (const char *ucx, ptrdiff_t buffer_size)
 Compute the length of an extended UNICODE string.
const char * sobj_ucx_string_index (const char *ucx, size_t index)
 Find the character or variable referece at the specified index.
size_t sobj_ucx_string_replace (const char *ucx, ptrdiff_t position, ptrdiff_t length, const char *replace, char *buffer)
 Replace a substring of an extended UNICODE string.
bool sobj_ucx_string_ok (const char *ucx, ptrdiff_t ucx_size)
 Check if a UCX string is valid.
size_t sobj_uc_to_ucx (const char *uc, ptrdiff_t uc_size, char *buffer)
 Convert an UTF-8 encoded UNICODE string to an UCX string.
size_t sobj_ucx_string_quote (const char *ucx, char *buffer, ptrdiff_t buffer_size)
 Create a quoted string representation of an extended UNICODE string.
size_t sobj_ucx_string_resolve (const char *ucx, struct sobj_env *env, struct sobj_buffer *buffer)
 Resolve all variable refereces in an extended UNICODE string.


Define Documentation

#define SOBJ_UC_ASCII_ESC   ((char)0x1b)
 

The ASCII ESC character.

#define SOBJ_UC_ASCII_ETX   ((char)0x3)
 

The ASCII ETX character.

#define SOBJ_UC_ASCII_STX   ((char)0x2)
 

The ASCII STX character.

#define SOBJ_UC_NEXT  ) 
 

Value:

do {                                                                    \
    ++(P);                                                              \
    while ((*(P) & 0xc0) == 0x80) ++(P);                                \
  } while (false)
Advance P to the next UTF-8 character.

Parameters:
P Pointer into a UNICODE string. This has to be a value l-value.

#define SOBJ_UC_PREV  ) 
 

Value:

do {                                                                    \
    --(P);                                                              \
    while ((*(P) & 0xc0) == 0x80) --(P);                                \
  } while (false)
Rewind P to the previous UTF-8 character.

Parameters:
P Pointer into a UNICODE string. This has to be a value l-value.


Enumeration Type Documentation

enum sobj_uc_sig
 

Enumeration of UNICODE signatures.

Enumeration values:
SOBJ_UC_NONE  No UNICODE signature.
SOBJ_UC_UTF8  UTF-8.
SOBJ_UC_UTF16BE  UTF-16 (big endian).
SOBJ_UC_UTF16LE  UTF-16 (little endian).
SOBJ_UC_UTF32BE  UTF-32 (big endian).
SOBJ_UC_UTF32LE  UTF-32 (little endian).
SOBJ_UC_SCSU  SCSU compressed.


Function Documentation

unsigned sobj_uc_code const char *  uc_char  ) 
 

Return the UNICODE code point of a UTF-8 character sequence.

Parameters:
uc_char Pointer to the UTF-8 character sequence.
Returns:
The character code represented by the specified character sequence. If the character sequence is not valid, the special value 0 is returned.

const char* sobj_uc_string_index const char *  uc_string,
size_t  index
 

Find the character at the specified index.

Parameters:
uc_string The UNICODE string.
index The index of the requested UNICODE character.
Returns:
The function returns a pointer to the first byte of the requested UNICODE character. If the specified index points beyond the end of the string, a pointer to the terminating null-character is returned.

enum sobj_uc_sig sobj_uc_string_sig const char *  string  ) 
 

Check if a string starts with a UNICODE encoding signature.

Parameters:
string The string to be checked.
Returns:
The function returns the UNICODE encoding signature ID of the signature found in the first bytes of the specified string. If no signature is found, the constant SOBJ_UC_NONE is returned.

const char* sobj_uc_string_skip_sig const char *  uc_string  ) 
 

Skip the UNICODE UTF-8 signature.

The function checks if a UTF-8 encoded string starts with a UNICODE UTF-8 signature (0xef, 0xbb, 0xbf). If a UNICODE signature is found, the function skips the signature.

Parameters:
uc_string The UNICODE string.
Returns:
A pointer to the first byte following the UNICODE signature. If no UNICODE UTF-8 signature is found, the parameter uc_string is returned.

size_t sobj_uc_strlen const char *  uc_string  ) 
 

Compute the length of a UNICODE string.

Parameters:
uc_string The UNICODE string, encoded as UTF-8.
Returns:
The number of UNICODE characters in the specified string.

size_t sobj_uc_strnlen const char *  uc_string,
ptrdiff_t  buffer_size
 

Compute the length of a UNICODE string.

The function is a variant of sobj_uc_strlen(). It takes an extra argument specifying the size of the buffer holding the specified string. Use this function if the string is not null-terminated.

Parameters:
uc_string The UNICODE string, encoded as UTF-8.
buffer_size The size of the buffer holding the UNICODE string (in bytes).
Returns:
The number of UNICODE characters in the specified string.

size_t sobj_uc_to_ucx const char *  uc,
ptrdiff_t  uc_size,
char *  buffer
 

Convert an UTF-8 encoded UNICODE string to an UCX string.

The function converts an UTF-8 encoded UNICODE string to an extended UNICODE string. The function will discard invalid UTF-8 sequences from the specified input string. Pairs of surrogate characters are replaced by the represented UNICODE character, unmatched surrogate characters are discarded.

If the specified input string starts with a UTF-8 UNICODE signature, the signature is stripped.

Parameters:
uc The UTF-8 encoded input string.
uc_size The size of the buffer holding the string uc. Negative values indicate that the string uc is null-terminated.
buffer Pointer to the buffer receiving the UCX string. The UCX string written to buffer will be terminated with a null-byte. If this is a null-pointer, no output is generated.
Returns:
The function returns the number of bytes of output, not counting the terminating null-byte.

const char* sobj_ucx_next const char *  ucx  ) 
 

Skip to the next UTF-8 character or variable referece in an UCX.

The function skips over an UTF-8 character or variable referece in a specified extended UNICODE string.

Parameters:
ucx The extended UNICODE string. Note that the specified string has to be null-terminated.
Returns:
A pointer to the next UTF-8 character or variable referece.
Note:
If the extended UNICODE string ucx contains invalid character sequences (e.g. invalid UTF-8 sequences, unrecognized ESC sequences, or unbalanced variable referece delimiters), a runtime error message is generated and the bogus character is skipped as if it were a valid ASCII character.

The function will not catch all UTF-8 encoding errors.

See also:
sobj_ucx_next_bounded().

const char* sobj_ucx_next_bounded const char *  ucx,
const char *  ucx_end
 

Skip to the next UTF-8 character or variable referece in an UCX.

The function skips over an UTF-8 character or variable referece in a specified extended UNICODE string.

This is the bounds-checking variant of sobj_ucx_next().

Parameters:
ucx The extended UNICODE string.
ucx_end The end of the extended UNICODE string (i.e. pointer to the byte following the last byte of the string).
Returns:
A pointer to the next UTF-8 character or variable referece. If the end of the string is reached, the special value ucx_end is returned.
Note:
If the extended UNICODE string ucx contains invalid character sequences (e.g. invalid UTF-8 sequences, unrecognized ESC sequences, or unbalanced variable referece delimiters), a runtime error message is generated and the bogus character is skipped as if it were a valid ASCII character.

The function will not catch all UTF-8 encoding errors.

See also:
sobj_ucx_next().

const char* sobj_ucx_next_strict const char *  ucx  ) 
 

Skip to the next UTF-8 character or variable referece in an UCX.

The function skips over an UTF-8 character or variable referece in a specified extended UNICODE string. This is a variant of the sobj_ucx_next() function. In contrast to sobj_ucx_next(), this function returns a null-pointer when an invalid UNICODE character, escape sequence, or variable referece is encountered. Also, this function will not generate runtime error messages for invalid UCX strings.

Parameters:
ucx The extended UNICODE string.
Returns:
A pointer to the next UTF-8 character or variable referece. In case of an error, a null-pointer is returned.
Note:
The function will not catch all UTF-8 encoding errors.

const char* sobj_ucx_string_index const char *  ucx,
size_t  index
 

Find the character or variable referece at the specified index.

Parameters:
ucx The extended UNICODE string.
index The index of the requested UNICODE character or variable referece.
Returns:
The function returns a pointer to the first byte of the requested UNICODE character or variable referece. If the specified index points beyond the end of the string, a pointer to the terminating null-character is returned.

bool sobj_ucx_string_ok const char *  ucx,
ptrdiff_t  ucx_size
 

Check if a UCX string is valid.

The function checks if a specified UCX string is valid. An UCX string is invalid if it contains unrecognized escape sequence, unbalanced variable referece delimiters. Obvious UTF-8 encoding errors are also reported as an invalid UCX string (the function will not catch all possible UTF-8 encoding errors).

Parameters:
ucx The UCX string to be checked.
ucx_size The size of the UCX string (in bytes). A negative value indicates that the string is null-terminated. Note that if the string ucx is shorter that the specified size (i.e. contains a null-byte), an error is indicated.
Return values:
true The specified string is a value UCX string.
false The specified string contains errors.

size_t sobj_ucx_string_quote const char *  ucx,
char *  buffer,
ptrdiff_t  buffer_size
 

Create a quoted string representation of an extended UNICODE string.

The function creates a quoted string representation of an extended UNICODE string that is compatibe with the simple object text serialization syntax. Variable refereces are quoted appropriately. The generated quoted string will contain 7 bit ASCII characters only.

Parameters:
ucx The extended UNICODE string.
buffer The buffer receiving the quoted string representation, including a terminating null-byte. If this is a null-pointer, no output is generated.
buffer_size The size of the output buffer buffer. If this is 0 or negative, no output is generated.
Returns:
The function returns the number of ASCII characters in the quoted string representation, not counting the terminating null-character.

size_t sobj_ucx_string_replace const char *  ucx,
ptrdiff_t  position,
ptrdiff_t  length,
const char *  replace,
char *  buffer
 

Replace a substring of an extended UNICODE string.

The function replaces a specified substring of an extended UNICODE string with another extended UNICODE string.

Parameters:
ucx The original extended UNICODE string.
position The index position of the first UNICODE character or variable referece to be replaced. If the value is negative or greater than the number of characters and refereces in the string ucx, the specified string replace is appended to ucx.
length The number of UNICODE characters or variable refereces to be replaced. If the parameter is larger that the number of characters available or is negative, the entire rest of the string is replaced.
replace The replacemet string.
buffer The output buffer receiving the modified string. If this is a null-pointer, no output is generated.
Returns:
The function returns the number of bytes written to the output buffer, not counting the terminating null-byte.

size_t sobj_ucx_string_resolve const char *  ucx,
struct sobj_env env,
struct sobj_buffer buffer
 

Resolve all variable refereces in an extended UNICODE string.

The function recursively resolves all variable referece in an extended UNICODE string.

Parameters:
ucx The extended UNICODE string.
env The environment. This may be a null-pointer.
buffer The buffer receiving the resolved string, not including a terminating null-byte. The resolved string is appended to the buffer.
Returns:
The function returns the number of bytes of the resolved string.
Note:
The string created will not be null-terminated.

size_t sobj_ucx_strlen const char *  ucx  ) 
 

Compute the length of an extended UNICODE string.

Parameters:
ucx The extended UNICODE string.
Returns:
The number of UNICODE characters and variable refereces in the specified string.

size_t sobj_ucx_strnlen const char *  ucx,
ptrdiff_t  buffer_size
 

Compute the length of an extended UNICODE string.

The function is a variant of sobj_ucx_strlen(). It takes an extra argument specifying the size of the buffer holding the specified string. Use this function if the string is not null-terminated.

Parameters:
ucx The extended UNICODE string.
buffer_size The size of the buffer holding the extended UNICODE string (in bytes).
Returns:
The number of UNICODE characters and variable refereces in the specified string.


Generated on Sat Jul 23 16:07:33 2005 for sobject by  doxygen 1.3.9.1