Main Page | Modules | Alphabetical List | Data Structures | File List | Data Fields | Globals | Related Pages

sobj_char.c File Reference


Detailed Description

Helper functions for character handling.

#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>
#include "sobject_p.h"

Include dependency graph for sobj_char.c:

Include dependency graph

Functions

void sobj_html_cref_init (void)
 Initialize the HTML character reference tables.
int sobj_html_cref_code_compare (const void *, const void *)
 qsort() comparison function for HTML character reference table entries.
unsigned sobj_html_cref_lookup (const char *)
 Lookup a named HTML character reference.
const char * sobj_char_html_cref (unsigned code)
 Return the character reference for the specified code point.
unsigned sobj_char_unquote (const char **s_ptr, ptrdiff_t length)
 Translate an escape sequence to a UNICODE character.
unsigned sobj_char_decode_utf8 (const char **p_ptr, ptrdiff_t length)
 Decode a single UTF-8 sequence to a UNICODE code point.
ptrdiff_t sobj_char_encode_utf8 (unsigned code, char *buffer)
 Encode a UNICODE code point to a UTF-8 sequence.

Variables

const char * html_cref [0x10000]
 A list of HTML character references for UNICODE characters.

Function Documentation

unsigned sobj_char_decode_utf8 const char **  p_ptr,
ptrdiff_t  length
 

Decode a single UTF-8 sequence to a UNICODE code point.

The function reads a UTF-8 sequence and decodes it to a UNICODE code point. The UTF08 sequence may be up to 6 characters, so the return value may be any 32 bit unsigned integer. Note that the maximum UNICODE code point is 0x10ffff, so all 5 or 6 character UTF-8 sequences decode to an invalid code point (beyond 0x1fffff, which is the maximum value representable by 4 byte sequences).

The function does not check if the value is a valid UNICODE code point, nor does it check for surrogate pairs.

Parameters:
p_ptr Pointer to a pointer to the first character of the sequence. If a valid UTF-8 is decoded, *p_ptr will be set to point to the last character of the sequence.
length The length of the string being analyzed. The function will not read past the end of the string. Note that *p_ptr+length must be within the address space, so passing INT_MAX for an unknown string length does not work.
Returns:
The function returns the code point represented by the sequence, or 0 if an invalid sequence is found. A single ASCII character is always is valid UTF-8 sequence. The value returned may be a surrogate or an invalid UNICODE code point, the function does not check the value.

ptrdiff_t sobj_char_encode_utf8 unsigned  code,
char *  buffer
 

Encode a UNICODE code point to a UTF-8 sequence.

The function takes a (non-null) UNICODE code point and translates it to a UTF-8 sequence. The function does not check if the code point is valid. For code points beyond 0x1fffff, the function will create non-UNICODE 5 and 6 byte UTF-8 sequences (as defined in ISO/IEC 10646-1). The generated sequences will be null-terminated.

Parameters:
code The code point. This may not be null.
buffer Pointer to the buffer receiving the UTF-8 sequence, or a null-pointer. If not a null-pointer, the UTF-8 sequence is stored to buffer[]. In the general case, this should be 7 bytes long (6 bytes for the UTF-8 sequence plus the terminating null-character).
Returns:
The function returns the length of the UTF-8 sequence, not counting the terminating null-character.

const char* sobj_char_html_cref unsigned  code  ) 
 

Return the character reference for the specified code point.

The function translates a UNICODE code point to a HTML character reference.

Parameters:
code The UNICODE code point.
Returns:
The function returns the character reference name (without the ampersand and the semicolon). If no character reference is available for the specified code point, a null-pointer is returned.

unsigned sobj_char_unquote const char **  s_ptr,
ptrdiff_t  length
 

Translate an escape sequence to a UNICODE character.

The function reads an escape sequence (without the escape character) and returns the UNICODE code point for the referenced character. The following types of escape sequences are recognized:

  • C style escape sequences, including hex and octal escapes.
  • The sequences ``e'' and ``E''. These are interpreted as an ASCII ESC character.
  • Java style 16bit UNICODE sequences (``u'', followed by a four digit hex number).
  • 32 bit UNICODE characters. These are similar to Java style UNICODE characters, but are introduced by an uppercase ``U''. The code point is specified as a 8 digit hex number.
  • HTML style character references. These are introduced by an ampersand. Both numeric and named character references are allowed.
  • Unrecognized characters are simply unmasked.

If an escape sequence is not recognized, it is assumed to represent the first character of the sequence. E.g. an ampersand followed by an unrecognized character reference is interpreted as a masked ampersand.

Parameters:
s_ptr Pointer to a pointer to the first character of the escape sequence, not including the escape character. The function will set *s_ptr to point to the last character of the sequence. E.g. for the escape sequence "<tt>&amp;uuml;</tt>", *s_ptr will be set to point to the "<tt>;</tt>".
length The length of the string containing the escape sequence. The function will not read past the end of the string.
Note:
  • Escape sequences yielding the code 0 are accepted by this function. The caller may have to do an explicit check for null-values returned by this function if null-values are not allowed in the specific context.
  • The function assumes that the character string referenced by s is null-terminated.
Returns:
The function returns the code point of the recognized character. The function never fails.
Note:
The function may not be called on the end of the string (i.e. with length==0).

int sobj_html_cref_code_compare const void *  p1,
const void *  p2
[static]
 

qsort() comparison function for HTML character reference table entries.

void sobj_html_cref_init void   )  [static]
 

Initialize the HTML character reference tables.

unsigned sobj_html_cref_lookup const char *  cref  )  [static]
 

Lookup a named HTML character reference.

The function translates a HTML character reference name to a UNICODE code point. If the name is not recognized, the function returns 0.

Parameters:
cref A null-terminated string holding the character reference name (without the ampersand and the semicolon.
Returns:
The UNICODE code point of the character reference, or 0 if the character name is not recognized.


Variable Documentation

const char* html_cref[0x10000] [static]
 

A list of HTML character references for UNICODE characters.


Generated on Sat Jul 23 16:04:57 2005 for sobject by  doxygen 1.3.9.1