Fast Infoset

   
By Paul Sandoz, Alessando Triglia and Santiago Pericas-Geertsen, June 2004  

Articles Index

This article presents the concepts and ideas of the Fast Infoset specification and fast infoset documents.

The Fast Infoset standard draft (currently being developed as joint work by ISO/IEC JTC 1 and ITU-T) specifies a binary format for XML infosets that is an efficient alternative to XML. An instance of this binary format is called a fast infoset document. Fast infoset documents are analogous to XML documents. Each has a physical form and an XML infoset. Fast infoset documents are, given the results presented, faster to serialize and parse, and smaller in size, than the equivalent XML documents. Thus, fast infoset documents may be used whenever the size and processing time of XML documents is an issue.

The binary format is optimized to balance the needs of both document size and processing time. Fast infoset documents are useful in a number of domains from bandwidth- and resource-constrained mobile devices to high-bandwidth high-throughput systems. In general, smaller documents are possible at the expense of either increased processing or loss of self-description and dependence on a schema. Faster processing is possible at the expense of loss of self-description and dependence on a schema. For example, standard compression (LZH) or XML-specific compression techniques ( XMill) may be applied to XML documents to obtain smaller document sizes, but this adds to the processing time, especially in the compression phase, and can affect server-side performance.

The article has the following content:



The Fast Infoset Standard and Abstract Syntax Notation One (ASN.1)

To facilitate interoperability, the Fast Infoset specification uses the existing and proven ASN.1 standards. The specification is being standardized as an ITU-T Recommendation | International Standard within ITU-T SG 17 and ISO/IEC JTC1 SC6.

ASN.1 is a formal language for abstractly describing messages that may be encoded using one of the set of ASN.1 Encoding Rules.

This article is consistent with terms specified in, and the general direction of, the latest draft of the Fast Infoset standard (as of publication of this article). Since the standard is still under development, this article should be considered a work in progress.

Additional material and ideas are also presented. Whenever such material relates to known issues or possible future enhancements of the standard, this will be stated. Further material, not directly related to the standard, is also presented.

The Fast Infoset specification specifies an ASN.1 Schema supporting the XML Information Set (an ASN.1 module). The ASN.1 types and components in the ASN.1 module describe information items and properties of those items.

A fast infoset document is an encoding of a fast infoset value (an ASN.1 value) whose ASN.1 type, defined in the ASN.1 module, corresponds to the document information item . The default encoding of a fast infoset document uses the Packed Encoding Rules with extensions.

Fast infoset documents may be serialized and parsed, just like XML documents. Figure 1 highlights the serialization and parsing process as follows:

  • The serialization process producing a fast infoset document is, conceptually, the result of constructing a fast infoset value from an XML infoset and encoding the fast infoset value to produce a fast infoset document.
  • The parsing process producing an XML infoset is, conceptually, the result of decoding a fast infoset document to produce a fast infoset value and constructing an XML infoset from the fast infoset value.


Figure 1. Conceptual serialization and parsing of fast infoset documents

The construction of a fast infoset value from an XML infoset or vice versa will result in the use of tables and indexing to compress common string information. This will reduce document size while still ensuring that the documents may be serialized and parsed efficiently.

The conceptual steps presented in Figure 1 can be optimized so that it is not necessary for a complete fast infoset value to exist in memory at any time during the process, so long as the resulting XML infoset (or the resulting fast infoset document) is the same as if the complete fast infoset value was constructed. This allows for efficient implementation where, for example, the XML infoset corresponds to a DOM document or a set of SAX events.

Related Standards

The Fast Infoset standard is one of a consistent set of ASN.1-based standards for improving the performance of XML processing.

Fast Web Services

Fast Web Services, (also being standardized as an ITU-T Recommendation | ISO/IEC International Standard in parallel with Fast Infoset, makes use of fast infoset documents for carrying the content of a SOAP message (a SOAP header block, a child of a SOAP body, or a child of a SOAP fault detail) when:

  • A schema is not used or not available; or
  • It is stated in the WSDL, as an annotation of the SOAP binding, that the Fast Infoset should be used.


In addition, Fast Web Services defines a specific MIME media type, "application/soap+fastinfoset", for SOAP messages that are serialized as fast infoset documents. This is compatible with the SOAP HTTP binding.

SOAP 1.2 states:

"[a conforming implementation of the SOAP HTTP binding] MAY send requests and responses using other media types providing that such media types provide for at least the transfer of SOAP XML Infoset".

Mapping W3C XML Schema Definitions Into ASN.1

It is expected that a future amendment to ITU-T Rec. X.694 | ISO/IEC 8825-5, Mapping W3C XML Schema Definitions into ASN.1 (sometimes referred to as Fast Schema), will specify the use of fast infoset documents in the following cases:

  • A wildcard that is the term of a particle (xsd:any)
  • An element declaration whose type definition is "xsd:anyType"


Currently X.694 specifies the use of a character string in these cases.

Using X.694 in conjunction with the Fast Infoset specification will result in encodings that can be processed more efficiently and be smaller in size whenever the original schema contains wildcards or element declarations of type "xsd:anyType".

Advantages and Properties of the Fast Infoset

There are a number of generic advantages of fast infoset documents, and specific properties of the binary format, that contribute to faster parsing, faster serializing, and smaller document sizes when compared with equivalent XML documents.

Such generic advantages are:

  • No end-tags. The duplication of characters for end-tags is not required.
  • No escaping of character data. Checking each character to see if it needs to be escaped can be time-consuming, and replacing content may result in additional memory usage and copying.
  • Length-prefixing of content. Length-prefixing enables a decoder to allocate resources accurately and possibly reject content immediately if the length is considered too large.
  • Indexing of repeated strings. Indexing reduces the size of the document by replacing a commonly-used string with an integer. Element and attribute names are examples of such repeated strings.
  • Indexing of qualified names. This enables a decoder to associate an index with a qualified name and obtain a particular object (associated with that index), which was previously constructed and added to a table to be indexed. Thus, repeated calculation to obtain the [namespace name] given the [prefix] and [in-scope namespaces] of an element information item or an attribute information item is not required. This approach will also reduce the size of Fast Infoset documents since repeated [prefix] indexes are not required.
  • Embedding of binary content. Binary content does not need to be converted to and from the Base64 character representation.
  • Preservation of state for documents with similar vocabularies. Indexing can be preserved and reused for multiple documents that make use of the same information items.


A number of specific, and advantageous, properties of the binary format for fast infoset documents are:

  • Huffman-style encoding of [children] information items. More-common information items that are among the [children] of an element information item or a document information item are encoded using fewer bits than less-common items. For example, an element information item and a chunk of character information items will be encoded in fewer bits than a processing instruction information item .
  • Byte alignment: Alignment on well-defined and known boundaries makes for more efficient implementation of encoders and decoders as well as ease of implementation.
  • Packing of indexes and length prefixes: The packing of integer values (associated with the indexing of a string or the length of content) makes for smaller sizes of Fast Infoset documents at little expense to the serialization and parsing of such documents.
  • Indefinite length of [children]: To support streaming (for example, to support SAX-based serializers) the [children] of an element information item or a document information item are encoded in a way that does not require that their number be known in advance.


Supported XML Information Items and Properties

Fast infoset documents support all of the specified information items, and a subset of the properties of information items. This results in two minor issues:

  1. The excluded properties need to be computed from the information items constructed from a fast infoset document. This is no different from constructing an information set from parsing (and possibility validating) an XML document.
  2. Synthetic infosets (constructed by means other than a parse) may have no correspondence to fast infoset documents (this also applies to XML documents). In this case, resolution of such inconsistencies will be required (for example, resolving in-scope namespaces).


Table 1 presents a subset of information items (common items) with the corresponding supported properties highlighted in bold. Properties such as [parent], [owner-element] and [in-scope namespaces] can be computed from the supported properties of the same item or other items. Other properties such as [specified], [attribute type] and [references] can be determined from the DTD. Further properties such as [document element] and [all declarations processed] have fixed values.

A fast infoset document will also chunk character information items into a single ASN.1 type that represents a sequence of such items.

In addition to the information items specified in the XML Infoset recommendation, a further item, the octet information item , is supported. A fast infoset document will chunk octet information items into a single ASN.1 type that represents a sequence of such items.

Table 1. Information Items and supported properties (in bold)
 
 
Document Information Item
Element information Item
Attribute Information Item
[children]
[namespace name]
[namespace name]
[document element]
[local name]
[local name]
[notations]
[prefix]
[prefix]
[unparsed entities]
[children]
[normalized value]
[base URI]
[attributes]
[specified]
[character encoding scheme]
[namespace attributes]
[attribute type]
[standalone]
[in-scope namespaces]
[references]
[version]
[base URI]
[owner element]
[all declarations processed]
[parent]
 

 
Octet Information Items

Since a fast infoset document is a binary encoding, it is possible to embed binary content in it directly as a sequence of octets. It is not necessary to represent the binary content as a Base64 string (as is usually done in the case of XML documents) and include the corresponding sequence of character information items in the infoset.

The Fast Infoset standard extends the XML Information Set to support the concept of an octet information item . According to this conceptual extension of the XML Information Set, there is an octet information item for each octet that appears in a fast infoset document. Each octet is a logically separate information item, but applications are free to chunk octets into larger groups (sequence of octets) as necessary or desirable.

An octet information item has the following properties:

  1. [octet] An octet (a integer value in the range 0 to 255).
  2. [parent] The element information item which contains this information item in its [children] property.


A sequence of character information items can be constructed, using the Base64 encoding, from a sequence of octet information items .

Tables and Indexing

The use of tables and indexing is the primary mechanism by which the Fast Infoset compresses many of the strings present in an infoset. Recurring strings may be replaced with an index (an integer value) which points to a string in a table.

A serializer will add the first occurrence of a common string to the string table, and then, on the next occurrence of that string, refer to it using an index into the table. A hash table can be used for efficient checking of strings (the string being the key to obtaining the index; every time a unique string is added to the hash table, the index of the table is incremented).

A parser will add the first occurrence of a common string to the string table, and then, on the next occurrence of that string, obtain the string by using the index into the table. A simple array (which usually grows but may possibly shrink under certain conditions can be used.

The indexing of qualified names (a tuple of the [namespace name], [prefix] properties of a namespace information item and the [local name] property of an element information item or an attribute information item ) is a second level of indexing. A serializer will add the first occurrence of a qualified name tuple to the table (after adding each string to the string table if it has not occurred before) and then, on the next occurrence of that qualified name tuple, refer to it using an index into the table.

There are distinct string tables for the following strings:

  • [prefix] strings
  • [namespace name] strings
  • [local name] strings
  • [normalized value] strings
  • generic strings


The generic string table will be used for chunks of character information items .

Distinct string tables produce smaller indexes and therefore reduce the overall size of fast infoset documents. Besides, they allow the serializer to 'tune' the indexing according to the common properties of the strings in question. For example, prefixes will be short (often less than 5 characters) and namespace names will be URIs (often URLs with a common sequence of characters at the start of the string: "http://").

Mandatory and Optional Constrained Indexing

It is considered appropriate to make the indexing of certain strings mandatory, while allowing other types of strings to be optionally indexed with encoder constraints.

Strings corresponding to [namespace name], [local name], and [prefix] properties are very likely to repeat, thus the overhead (size and processing time) of encoding the boolean, indicating the addition of a string to the table, may be undesirable, especially when byte alignment and packing of index and length prefixes are important (an extra bit would result in less packing).

Strings corresponding to the [normalized value] property and to a chunk of character information items are less likely to repeat, hence it is considered appropriate to let the encoder choose whether to index such strings or not. The serializer can apply certain policies such as only indexing strings shorter than a given length or using contextual information to index common words in a language (potentially splitting strings into substrings such that indexing can be applied to the substrings).

Under certain circumstances, it may not be possible for a serializer to perform mandatory indexing.

Pre-Computed Tables and Base Tables

A pre-computed table, which may be considered an additional property of the document information item , can improve the performance of the parsing process and of the serializing process (for example, a serializer can refer to indexes without having to search for strings to create indexes, and a decoder may not need to dynamically grow arrays) but is unlikely to reduce the size of fast infoset documents (unless there is specific knowledge about what strings may occur as character information items ).

A pre-computed table may be calculated from a schema (a W3C XML Schema or a DTD). Since this is a serializers' option, it is not necessary to specify what strings are indexed and the order of indexing.

A base table, which is a property of the pre-computed table, is a URI that identifies either an XML document or a fast infoset document, which may itself contain a pre-computed table (but no further base table). Either can provide a well-known vocabulary.

A caching mechanism may be used (with appropriate techniques to ensure that the cache is updated if such documents change), such that a fast infoset document may be reduced in size (because all or most strings are referred to by index and are not included in the document).

Examples of Indexing

Two examples are presented below.

The first example shows an XML document and the indexes that will be associated with its items (strings and qualified names). The index that will be used for each item (either a string or a qualified name) is shown as a number surrounded by curly brackets before the item. An indexed item is shown as a number surrounded by square brackets replacing the item. Every string that is a chunk of character information items will be indexed in the generic string table.

The XML document in this example is:

<env:E<pre><root>
                  

     <tag>one</tag>
                  

  <tag>two</tag>
                  

  <anotherTag>one</anotherTag>
                  

</root>
                


The corresponding representation with indexed strings and indexed qualified names is presented below in a symbolic form (end tags are struck-through since they are not required in fast infoset documents).

<env:E<pre>
                   {0}<root>
                  

    
                   {1}<tag>
                   {0}one
                   </tag>
                  

  
                   [1]<>
                   {1}two
                   </tag>
                  

  
                   {2}<anotherTag>
                   [0]
                   </anotherTag>
                  

                    </root>
                


The statement " [1]<> {1}two </tag> " means that the element information item has a qualified name that corresponds to index 1 in the qualified name table and that an index of 1 will be used for the chunk of character information items that is "two" in the generic string table.

The second example shows a SOAP 1.2 message, with one SOAP header block and a SOAP body containing a UBL document.

<env:Envelope
                  

xmlns:env="http://www.w3.org/2003/05/soap-envelope">
                  

  <env:Header>
                  

          <xa:transaction
                  

          xmlns:xa="http://example.org/transaction">
                  

                     <xa:id>3781</xa:id>
                  

             </xa:transaction>
                  

     </env:Header>
                  

 <env:Body>
                  

            <ors:OrderResponseSimple
                  

         xmlns:ors="urn:oasis:names:tc:ubl:OrderResponseSimple:1.0:0.70"
                  

           xmlns:cat="urn:oasis:names:tc:ubl:CommonAggregateTypes:1.0:0.70">
                  

                      <cat:ID>1</cat:ID>
                  

                      <cat:IssueDate>2003-02-03</cat:IssueDate>
                  

                       <ors:AcceptedIndicator>1</ors:AcceptedIndicator>
                  

                        <ors:RejectionReasonCode/>
                  

                    <cat:Note/>
                  

                   <cat:ReferencedOrder>
                  

                         <cat:BuyersOrderID>20031234-1</cat:BuyersOrderID>
                  

                               <cat:SellersOrderID>154135798</cat:SellersOrderID>
                  

                              <cat:IssueDate>2003-02-03</cat:IssueDate>
                  

                       </cat:ReferencedOrder>
                  

                </ors:OrderResponseSimple>
                  

    </env:Body>
                  

</env:Envelope>
                


The corresponding representation with indexed strings and indexed qualified names is presented below in a symbolic form:

                    {0}<
                   {0}env:
                   {0}Envelope
                  

                    [0]=
                   {0}"http://www.w3.org/2003/05/soap-envelope">
                  

  
                   {1}<
                   [0]:
                   {1}Header>
                  

                
                   {2}<
                   {1}xa:{2}transaction
                  

            
                   [1]=
                   {1}"http://example.org/transaction">
                  

                  
                   {3}<
                   [1]:
                   {3}id>
                   {0}3781
                   </xa:id>
                  

                
                   </xa:transaction>
                  

       
                   </env:Header>
                  

   
                   {4}<
                   [0]:
                   {4}Body>
                  

          
                   {5}<
                   {2}ors:
                   {5}OrderResponseSimple
                  

            
                   [2]=
                   {2}"urn:oasis:names:tc:ubl:OrderResponseSimple:1.0:0.70"
                  

                xmlns:
                   {3}cat=
                   {3}"urn:oasis:names:tc:ubl:CommonAggregateTypes:1.0:0.70">
                  

                   
                   {6}<
                   [3]:
                   {6}ID>
                   {1}1
                   </cat:ID>
                  

                       
                   {7}<
                   [3]:
                   {7}IssueDate>
                   {2}2003-02-03
                   </cat:IssueDate>
                  

                        
                   {8}<
                   [2]:
                   {8}AcceptedIndicator>
                   {3}1
                   </ors:AcceptedIndicator>
                  

                        
                   {9}<
                   [2]:
                   {9}
                   RejectionReasonCode/>
                  

                      
                   {10}<
                   [3]:
                   {10}
                   Note/>
                  

                     
                   {11}<
                   [3]BuyersOrderID>
                   {4}20031234-1
                   </cat:BuyersOrderID>
                  

                            
                   {13}<
                   [3]:
                   {13}SellersOrderID>
                   {5}154135798
                   </cat:SellersOrderID>
                  

                           
                   [7]<>
                   [2]
                   </cat:IssueDate>
                  

                        
                   </cat:ReferencedOrder>
                  

          
                   </ors:OrderResponseSimple>
                  

      
                   </env:Body>
                  

                    </env:Envelope>
                


Note that the attribute information items in the [namespace attributes] property of an element information item are treated differently from the attribute information items in its [attributes] property. It is not necessary to index the "xmlns" prefix and it is only necessary to index the [local name] that is the [prefix] used for the namespace.

The statement " {0}< {0} env: {0}Envelope " means that an index of 0 will be used for the "env" prefix, for the "Envelope" [local name], and for the qualified name that consists of ("http://www.w3.org/2003/05/soap-envelope", "env", "Envelope"). The same index value can be used because [prefix]s, [local name]s, and qualified names are stored in different tables. The statement " [7]<> [2] </cat:IssueDate> " means that the element information item has a qualified name that corresponds to index 7 in the qualified name table and a chunk of character information items that correspond to index 2 in the generic string table. Table 2 presents the corresponding string tables for the SOAP message.

Table 2. String tables for example SOAP message
 
 
Index
Namespace name
Prefix
Local name
Generic string
0
http://www.w3.org/2003/05/soap-envelope
env
Envelope
3718
1
http://example.org/transaction
xa
Header
1
2
urn:oasis:names:tc:ubl:OrderResponseSimple:...
ors
xa
2003-02-03
3
urn:oasis:names:tc:ubl:CommonAggregateTypes:...
cat
id
1
4
 
 
Body
20031234-1
5
OrderResponse
154135798
 
 
6
ID
 
 
 
7
IssueDate
 
 
 
8
AcceptedIndicator
 
 
 
9
RejectionReasonCode
 
 
 
10
Note
 
 
 
11
ReferenceOrder
 
 
 
12
BuyersOrderID
 
 
 
13
SellersOrderID
 
 
 

 
Character Encoding

It has not yet been determined if the Fast Infoset specification should support just one character encoding or allow an open-ended set of character encoding, just as XML allows.

Support for at least UTF-8 and UTF-16 character encoding schemes would ensure that both Western and Asian documents can be efficiently supported (for example, Asian documents encoded in UTF-8 can result in larger sizes than equivalent documents encoded in UTF-16).

The [character encoding scheme] property of the document information item declares the character encoding scheme that is used. The Fast Infoset uses the ASN.1 type UTF8String for this property.

To support multiple character encodings, the existing ASN.1 string types that support particular character encodings (for example UTF8String ) cannot be used, and instead the ASN.1 type OCTET STRING is required. A value of the OCTET STRING will be the sequence of bytes obtained from the characters encoded using the declared character encoding scheme.

Although in general it is not regarded as good ASN.1 practice to represent character strings with an OCTET STRING type, this is considered appropriate in this case, and it is not expected that it will increase the complexity of the implementations of the Fast Infoset standard.

Java implementations, J2SE 1.4.X and beyond, may avail of the NIO library for modular character encoding support. By default the following character encoding schemes are supported:

US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8 Eight-bit UCS Transformation Format
UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

Encoder/Decoder Processing Instructions

Aggressive indexing of character information items may result in an encoder having no more resources to index the mandatory properties of element information items and/or attribute information items . Alternatively, an encoder may decide that indexing is not appropriate for certain information items.

A suggestion has been made that such resource control of indexing could be provided by way of specific processing instruction information items that an encoder generates and a decoder processes and discards, thus such items will not be members of the constructed XML infoset. Three such instructions could be provided:

  1. Stop indexing
  2. Start indexing
  3. Reset indexing

An encoder might insert a stop indexing instruction, after which the encoder would not perform mandatory and optional indexing until the encoder inserts a start indexing instruction.

An encoder might insert a reset indexing instruction, after which the encoder would discard all previous indexes, thus all tables would be reset (all indexes removed) and indexing would start at 0.

Given that these types of processing instruction information items would likely be far less common than other items (generally processing instruction information items are less common than other information items) it seems appropriate that the existing set of properties for a processing instruction information item be used rather than specifying an optimized form.

Canonical Fast Infoset

Canonicalization is an important feature for security, namely the signing of information. It has not yet been determined if the first revision of the Fast Infoset specification will specify a canonical form of fast infoset documents. Canonicalization can be difficult to specify correctly, and thus it may require more time. If so, it may be specified as a future amendment or as an additional specification.

A canonical form of a fast infoset document can be specified by:

  • Constraints on an XML information set
  • Constraints on fast infoset document features
  • Use of a canonical set of ASN.1 encoding rules


If two fast infoset documents have the same canonical form, then the two documents are logically equivalent within the given application context.

The constraints on the XML information set supported by a fast infoset document consist of the following:

  • UTF-8 character encoding scheme.
  • Normalization of line breaks occurring in a contiguous sequence of character information items .
  • Absent document type declaration information item.
  • Default attribute information items among the [namespace attributes] and [attributes].
  • Lexicographic order of the attribute information items among the [namespace attributes] and [attributes].


The constraints on fast infoset document features consist of the following:

  • A chunk of character information items must be a contiguous sequence and not split up into multiple chunks.
  • No pre-computed table.
  • No encoder/decoder processing instructions.
  • Fixed constraints for optional constrained indexing. A fixed value for the maximum length of a chunk of character information items or a [normalized value] to be indexed can be specified.


The Distinguished Encoding Rules (DER) may be used, however this will not result in efficient encoding of Fast Infoset Documents, since they will be larger than using the Packed Encoding Rules with extensions. The latter can be used since the features of the ASN.1 Schema for a subset of the XML Information Set ensures that the PER encoding with extensions encoding is canonical.

Exclusive Canonicalization

Additional specification is required if canonicalization is applied to a fast infoset sub-document such that the sub-document is not dependent on information in the whole document, thus the sub-document can be removed from document without requiring re-canonicalization. Such canonicalization is referred to as exclusive canonicalization. The mechanisms specified for XML documents by the Exclusive XML Canonicalization specification may apply to fast infoset documents if specified in terms of the XML Information Set.

Since indexing may result in indexes in the sub-document that were the result of indexing before the sub-document, further constraints are required to isolate the indexing for the sub-document. A reset processing instruction could be used, but this can result in unnecessary re-indexing after the sub-document. Alternatively, one more encoder/decoder processing instruction can be used:

  • push-pop indexing


The first occurrences of the push-pop instruction push the current table (calculated up to that point, including a pre-computed and base table) onto a stack (the same effect as a reset), and the second occurrence pops the original table off the stack (thus there is no nesting and the stack will only contain, at most, one table). Two push-pop indexing instructions will surround the sub-document, and the constraints presented previously (where appropriate) apply to the items of the sub-document.

Features such a byte-alignment on well-defined boundaries will ensure efficient implementation when providing a sequence of octets (that corresponds to the sub-document) to be signed.

Java API for XML Processing (JAXP) Integration

Integration into JAXP using the SAX API requires that a Fast Infoset reader (for parsing fast infoset documents) and a Fast Infoset handler (for serializing fast infoset documents) implement the SAX XMLReader and implement SAX ContentHandler (and possibly additional interfaces such as the SAX LexicalHandler) respectively. Fast Infoset readers can only parse SAX InputSources with a byte stream or system identifier.

The code extract below creates a fast infoset document from an xml document:

// Create a SAX parser
                  

SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
                  

                   

// Create a Fast Infoset handler to write a fast document
                  

// to the file fi_document
                  

FastInfosetEncoderHandler fieh =
                  

     new FastInfosetEncoderHandler(new File(fi_document));
                  

                   

// Parse an XML document from the file xml_document using
                  

// the Fast Infoset handler
                  

sp.parse(new File(xml_document), fieh);
                  

                

The code extract below performs an identity transform to parse the fast infoset document created in the previous code extract and creates a copy, which is identical to the source:

// Create a transformer
                  

Transformer tx = TransformerFactory.newInstance().newTransformer();
                  

                   

// Create a Fast Infoset reader
                  

FastInfosetDecoderReader fidr = new FastInfosetDecoderReader();
                  

                   

// Create a Fast Infoset handler to write a fast document
                  

// to the file fi_document
                  

fieh = new FastInfosetEncoderHandler(new File(fi_document_cpy));
                  

                   

// Create an input source for the previously create
                  

// fast infoset document
                  

InputSource is = new InputSource(new File(fi_document));
                  

                   

// Perform the identity transform
                  

tx.transform(new SAXSource(fidr, is), new SAXResult(fieh));
                  

                

Finally the code extract below parses the copy of the fast infoset document and creates an XML document, which should be equivalent to the original xml document used to create the first fast infoset document:

// Create an input source
                  

is = new InputSource(new File(fi_document_cpy));
                  

                   

// Create a stream result
                  

StreamResult sr = new StreamResult(new File(xml_document_cpy));
                  

                   

// Perform the identity
                  

tx.transform(new SAXSource(fidr, is), sr);
                

The same integration mechanisms applied to SAX may also apply to the Streaming API for XML (StAX), although it is not yet integrated into JAXP. To utilize the StAX API, it is required that a Fast Infoset stream reader and a Fast Infoset stream writer implement the StAX XMLStreamReader and StAX XMLStreamWriter respectively.

SAX Extension for Binary Content

Octet information items are not supported by existing XML-based APIs. Reading and writing octet information items requires either an extended API or a conversion between them and the character information items .

The SAX API supports the concept of optional extensions by use of the features and property mechanisms, which are put to good effect for the support of processing DTD declarations and lexical information (such as comment information items ). In the same manner, the SAX API can be extended to support octet information items .

An instance of an interface:

public interface BinaryContentHandler {
                  

        public void octets(byte[] buf, int start, int length)
                  

               throws SAXException;
                  

}
                

can be set as a property of the SAX XMLReader with, for example, the propertyId "http://xml.org/sax/properties/binary-content-handler".

If this property is not set, then a SAX reader will convert a sequence of octets to a sequence of character information items (Base64 encoding) and invoke the ContentHandler.characters method.

Performance Results for a SAX-based Prototype

A SAX-based prototype has been implemented based on Fast Infoset ideas, some of which differ from the ideas presented in this article. However, it is anticipated that the results of an updated prototype will be similar to, or better than, the results presented here.

The prototype differs mainly in the approach to indexing of qualified names and namespaces: a single string using the [prefix], a colon and [local name] or [uri] is used, thus there is no separate [prefix] string table or [local name] table. This is potentially slower for first occurrences of qualified names and less compact than what is now proposed, since more work is required to obtain a prefix (substring operations), and a prefix is repeated multiple times.

Parsing, Serializing and Size Results

The first set of results use the XBIS SAX-based performance test suite to measure parsing, serializing and the resulting size of documents.

Measurements were performed using J2SE 1.4.2 with Linux on a Sony Vaio PCG-GRX700P. Default JVM options were used. The Xerces SAX parser was used, as supplied by XBIS.

Two data sets of XML documents were chosen. The first data set, classified as small documents, consisted of the following XML documents:

  • A set of Universal Business Language (UBL) documents provided as samples from the UBL package version 0.7 (the UBL package is currently at version 1.0 Beta, and similar documents may be obtained). The set consists of 15 UBL documents ranging from 169 bytes to 10K with an average size of 5K.
  • A set of SOAP messages provided by the XBIS package. The set consists of 42 SOAP messages ranging from 373 bytes to 4K with an average size of 730 bytes.
  • A set of ANT scripts provided by the XBIS package. The set consists of 18 scripts ranging from 540 bytes to 10K with an average size of 5.5K.
  • A set of web-related Java Server Pages web application descriptions and tag lib descriptions. The set consists of 70 descriptions ranging from 173 bytes to 36K with an average size of 1.9K.


The second data set, classified as large documents, consisted of the following XML documents:

  • A set of Universal Business Language (UBL) documents provided as samples from the UBL package version 0.7 (the UBL package is currently at version 1.0 Beta, and similar documents may be obtained). The set consists of 9 UBL documents ranging from 11K to 70K with an average size of 39K.
  • The documents factobook.xml (size 4M), periodic.xml (size 114K), soap2.xml (size 131K) and weblog.xml (size 2.9M) provided by the XBIS package.


The fast infoset serializer was configured to optionally index strings less than 7 in length.

Results are presented in the form of percentages calculated by dividing the fast infoset measurement by the XML measurement and multiplying by 100 to obtain a percentage.

Figure 2 presents the results for the small documents.

Figure 2. Parsing, serializing and document sizes results for small documents

The parse % is between 20% and 32% or approximately 5 to 3 times faster. The serialize % is between 10% and 18% or approximately 10 to 5 times faster. The size % is between 30% and 74% or approximately 3.3 to 1.3 times smaller.

Figure 3 presents the results for the large documents.

Figure 3. Parsing, serializing and document sizes results for large documents

The parse % is between 23% and 33% or approximately 4.3 to 3 times faster. The serialize % is between 9% and 17% or approximately 11 to 5 times faster. The size % is between 30% and 74% or approximately 5 to 1.7 times smaller.

The XBIS measurement framework utilizes the XMLWriter, from David Megginson, for serializing XML documents. Further investigation is required to ascertain if the XMLWriter performance can be improved, or alternative approaches will be more performant (for example, JAXP and an identity transform). The serializing of XML documents utilizing binding frameworks or in application specific scenarios may result in much faster serialization, for example the application may know that certain text content will not contain characters that require escaping.

The size of fast infoset documents will be related to the quantity of repeating information in the XML document. Small XML documents are likely to result in a moderately smaller fast infoset documents, for example the small SOAP messages reduce by 25% since there is not much repeating information. Large XML documents may result in better reduction because there is more chance of repeating information, for example the soap2.xml XML document reduces by approximately 80%. However, large XML documents containing a large amount of text content, for example the factbook.xml XML document, are unlikely to result in a large reduction in size due to the fact that the text content is not indexed, and it is unlikely that the text (without deconstructing the text into smaller parts) will repeat.

For large XML documents, an approximation of fast infoset document size may be obtained by using the heuristic measurement of the total size of encoded characters (attribute values, text content and comments). Figure 4 presents the % of encoded characters for the large XML document. This correlates closely with the Fast Infoset size % in Figure 3.

Figure 4. % of characters for large XML documents

XML Test Results

The second set of results uses a modified version of the XML Test performance suite (to support the Fast Infoset), developed to compare the performance of XML parsers in Java and .Net. XML Test is designed to mimic the processing that takes place in the lifecycle of an XML document, and measures the throughput of a system processing XML documents.

Measurements were performed using J2SE 1.4.2 with Solaris 9 on a Sun Fire 280Ra. The JVM options specified by XML Test performance suite were used. The Xerces SAX parser shipped with JAXP was used.

The data used was various sizes of documents based in a simplified version of the UBL schema.

Further details about the measurement process and data used may be found in XML Test performance suite (Note: the modify and serialize stages do not apply for the SAX-based measurements, thus the tests only measure parsing and access stages).

Figure 5 presents the throughput results for the UBL documents. Results are presented in the form of ratios calculated by dividing the XML measurement by the fast measurement.

Figure 5. Ratio of throughput for documents of varying size

Acknowledgments

The authors would like to acknowledge the XBIS and XMLS (a previous incarnation of XBIS) work of Dennis Sosnoski. The ideas of XBIS and XMLS have proved valuable for specifying encoded features of fast infoset documents. In addition, the XBIS performance test suite has proved very convenient for performance testing.

The authors would like to thank Binu John, of Sun Microsystems, for performing the XML Test measurements.

Authors

Paul Sandoz
Paul is a Staff Engineer at Sun Microsystems

Alessandro Triglia
Alessandro is a Member of the Technical Staff at OSS Nokalva, Inc.

Santiago Pericas Geertsen
Santiago is a Staff Engineer at Sun Microsystems.

References

 

  1. XML Information Set
    W3C Recommendation "XML Information Set (Second Edition)", John Cowan, Richard Tobin, 4 February 2004.
  2. ASN.1 Information Site
  3. Fast Web Services
    See also Fast Web Services and Fast Infoset and Fast Web Services
  4. ASN.1: A Powerful Schema Notation for XML and Fast Web Services
  5. Mapping W3C XML Schema Definitions into ASN.1
    ITU-T Rec. X.694 (2004) | ISO/IEC 8825-5:2004, Mapping W3C XML Schema Definitions into ASN.1
  6. Canonical XML
    W3C Recommendation "Canonical XML Version 1.0", John Boyer, 15 March 2001.
  7. Exclusive XML Canonicalization
    W3C Recommendation "Exclusive XML Canonicalization Version 1.0", John Boyer, Donald E. Eastlake3rd, Joseph Reagle, 18 July 2002.
  8. XBIS XML Infoset Encoding
  9. XMill
    See also XMill postscript.
  10. Performance Considerations for Mobile Web Services
    Min Tian, Thiemo Voigt, Tomasz Naumowicz, Hartmut Ritter and Jochen Schiller. Performance Considerations for Mobile Web Services. Workshop on Applications and Services in Wiresless Networks, Bern, Switzerland, July 2003.
  11. Universal Business Language
  12. XML Test
    See also Performance.


Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.