UnicodeBom

Convert unicode content

Unicode is an encoding of textual material. The purpose of this module is to interface external-encoding with a programmer-defined internal- encoding. This internal encoding is declared via the template argument T, whilst the external encoding is either specified or derived.

Three internal encodings are supported: char, wchar, and dchar. The methods herein operate upon arrays of this type. That is, decode() returns an array of the type, while encode() expect an array of said type.

Supported external encodings are as follow:

Encoding.Unknown Encoding.UTF_8N Encoding.UTF_8 Encoding.UTF_16 Encoding.UTF_16BE Encoding.UTF_16LE Encoding.UTF_32 Encoding.UTF_32BE Encoding.UTF_32LE

These can be divided into non-explicit and explicit encodings:

Encoding.Unknown Encoding.UTF_8 Encoding.UTF_16 Encoding.UTF_32

More...

Constructors

this
this(Encoding encoding)

Construct a instance using the given external encoding ~ one of the Encoding.xx types

Members

Functions

decode
T[] decode(void[] content, T[] dst, size_t* ate)

Convert the provided content. The content is inspected for a BOM signature, which is stripped. An exception is thrown if a signature is present when, according to the encoding type, it should not be. Conversely, An exception is thrown if there is no known signature where the current encoding expects one to be present.

encode
void[] encode(T[] content, void[] dst)

Perform encoding of content. Note that the encoding must be of the explicit variety by the time we get here

Static functions

from
void[] from(T[] x, uint type, void[] dst, size_t* ate)

Convert from T into the given 'type'.

into
T[] into(void[] x, uint type, T[] dst, size_t* ate)

Convert from 'type' into the given T.

Inherited Members

From BomSniffer

encoding
Encoding encoding [@property getter]

Return the current encoding. This is either the originally specified encoding, or a derived one obtained by inspecting the content for a BOM. The latter is performed as part of the decode() method

encoded
bool encoded [@property getter]

Was an encoding located in the text (configured via setup)

signature
const(void)[] signature [@property getter]

Return the signature (BOM) of the current encoding

setup
void setup(Encoding encoding, bool found)

Configure this instance with unicode converters

test
const(Info)* test(void[] content)

Scan the BOM signatures looking for a match. We scan in reverse order to get the longest match first

Detailed Description

Encoding.UTF_8N Encoding.UTF_16BE Encoding.UTF_16LE Encoding.UTF_32BE Encoding.UTF_32LE

The former group of non-explicit encodings may be used to 'discover' an unknown encoding, by examining the first few bytes of the content for a signature. This signature is optional, but is often written such that the content is self-describing. When an encoding is unknown, using one of the non-explicit encodings will cause the decode() method to look for a signature and adjust itself accordingly. It is possible that a ZWNBSP character might be confused with the signature; today's unicode content is supposed to use the WORD-JOINER character instead.

The group of explicit encodings are for use when the content encoding is known. These *must* be used when converting back to external encoding, since written content must be in a known format. It should be noted that, during a decode() operation, the existence of a signature is in conflict with these explicit varieties.

See http://www.utf-8.com/ http://www.hackcraft.net/xmlUnicode/ http://www.unicode.org/faq/utf_bom.html/ http://www.azillionmonkeys.com/qed/unicode.html/ http://icu.sourceforge.net/docs/papers/forms_of_unicode/

Meta