|
|
| By Paul Sandoz, Alessando Triglia and Santiago Pericas-Geertsen, June 2004 |
|
This article presents the concepts and ideas of the Fast Infoset specification and fast infoset documents.
The Fast Infoset standard draft (currently being developed as joint work by ISO/IEC JTC 1 and ITU-T) specifies a binary format for XML infosets that is an efficient alternative to XML. An instance of this binary format is called a fast infoset document. Fast infoset documents are analogous to XML documents. Each has a physical form and an XML infoset. Fast infoset documents are, given the results presented, faster to serialize and parse, and smaller in size, than the equivalent XML documents. Thus, fast infoset documents may be used whenever the size and processing time of XML documents is an issue.
The binary format is optimized to balance the needs of both document size and processing time. Fast infoset documents are useful in a number of domains from bandwidth- and resource-constrained mobile devices to high-bandwidth high-throughput systems. In general, smaller documents are possible at the expense of either increased processing or loss of self-description and dependence on a schema. Faster processing is possible at the expense of loss of self-description and dependence on a schema. For example, standard compression (LZH) or XML-specific compression techniques ( XMill) may be applied to XML documents to obtain smaller document sizes, but this adds to the processing time, especially in the compression phase, and can affect server-side performance.
The article has the following content:
To facilitate interoperability, the Fast Infoset specification uses the existing and proven ASN.1 standards. The specification is being standardized as an ITU-T Recommendation | International Standard within ITU-T SG 17 and ISO/IEC JTC1 SC6.
ASN.1 is a formal language for abstractly describing messages that may be encoded using one of the set of ASN.1 Encoding Rules.
This article is consistent with terms specified in, and the general direction of, the latest draft of the Fast Infoset standard (as of publication of this article). Since the standard is still under development, this article should be considered a work in progress.
Additional material and ideas are also presented. Whenever such material relates to known issues or possible future enhancements of the standard, this will be stated. Further material, not directly related to the standard, is also presented.
The Fast Infoset specification specifies an ASN.1 Schema supporting the XML Information Set (an ASN.1 module). The ASN.1 types and components in the ASN.1 module describe information items and properties of those items.
A fast infoset document is an encoding of a fast infoset value (an ASN.1 value) whose ASN.1 type, defined in the ASN.1 module, corresponds to the document information item . The default encoding of a fast infoset document uses the Packed Encoding Rules with extensions.
Fast infoset documents may be serialized and parsed, just like XML documents. Figure 1 highlights the serialization and parsing process as follows:
|
The construction of a fast infoset value from an XML infoset or vice versa will result in the use of tables and indexing to compress common string information. This will reduce document size while still ensuring that the documents may be serialized and parsed efficiently.
The conceptual steps presented in Figure 1 can be optimized so that it is not necessary for a complete fast infoset value to exist in memory at any time during the process, so long as the resulting XML infoset (or the resulting fast infoset document) is the same as if the complete fast infoset value was constructed. This allows for efficient implementation where, for example, the XML infoset corresponds to a DOM document or a set of SAX events.
The Fast Infoset standard is one of a consistent set of ASN.1-based standards for improving the performance of XML processing.
Fast Web Services, (also being standardized as an ITU-T Recommendation | ISO/IEC International Standard in parallel with Fast Infoset, makes use of fast infoset documents for carrying the content of a SOAP message (a SOAP header block, a child of a SOAP body, or a child of a SOAP fault detail) when:
In addition, Fast Web Services defines a specific MIME media type, "application/soap+fastinfoset", for SOAP messages that are serialized as fast infoset documents. This is compatible with the SOAP HTTP binding.
SOAP 1.2 states:
"[a conforming implementation of the SOAP HTTP binding] MAY send requests and responses using other media types providing that such media types provide for at least the transfer of SOAP XML Infoset".
It is expected that a future amendment to ITU-T Rec. X.694 | ISO/IEC 8825-5, Mapping W3C XML Schema Definitions into ASN.1 (sometimes referred to as Fast Schema), will specify the use of fast infoset documents in the following cases:
Currently X.694 specifies the use of a character string in these cases.
Using X.694 in conjunction with the Fast Infoset specification will result in encodings that can be processed more efficiently and be smaller in size whenever the original schema contains wildcards or element declarations of type "xsd:anyType".
There are a number of generic advantages of fast infoset documents, and specific properties of the binary format, that contribute to faster parsing, faster serializing, and smaller document sizes when compared with equivalent XML documents.
Such generic advantages are:
A number of specific, and advantageous, properties of the binary format for fast infoset documents are:
Fast infoset documents support all of the specified information items, and a subset of the properties of information items. This results in two minor issues:
Table 1 presents a subset of information items (common items) with the corresponding supported properties highlighted in bold. Properties such as [parent], [owner-element] and [in-scope namespaces] can be computed from the supported properties of the same item or other items. Other properties such as [specified], [attribute type] and [references] can be determined from the DTD. Further properties such as [document element] and [all declarations processed] have fixed values.
A fast infoset document will also chunk character information items into a single ASN.1 type that represents a sequence of such items.
In addition to the information items specified in the XML Infoset recommendation, a further item, the octet information item , is supported. A fast infoset document will chunk octet information items into a single ASN.1 type that represents a sequence of such items.
|
Table 1. Information Items and supported properties (in bold)
|
|
|
Document Information Item
|
Element information Item
|
Attribute Information Item | |
|---|---|---|---|
|
[children]
|
[namespace name]
|
[namespace name]
|
|
|
[document element]
|
[local name]
|
[local name]
|
|
|
[notations]
|
[prefix]
|
[prefix]
|
|
|
[unparsed entities]
|
[children]
|
[normalized value]
|
|
|
[base URI]
|
[attributes]
|
[specified]
|
|
|
[character encoding scheme]
|
[namespace attributes]
|
[attribute type]
|
|
|
[standalone]
|
[in-scope namespaces]
|
[references]
|
|
|
[version]
|
[base URI]
|
[owner element]
|
|
|
[all declarations processed]
|
[parent]
|
||
Since a fast infoset document is a binary encoding, it is possible to embed binary content in it directly as a sequence of octets. It is not necessary to represent the binary content as a Base64 string (as is usually done in the case of XML documents) and include the corresponding sequence of character information items in the infoset.
The Fast Infoset standard extends the XML Information Set to support the concept of an octet information item . According to this conceptual extension of the XML Information Set, there is an octet information item for each octet that appears in a fast infoset document. Each octet is a logically separate information item, but applications are free to chunk octets into larger groups (sequence of octets) as necessary or desirable.
An octet information item has the following properties:
A sequence of character information items can be constructed, using the Base64 encoding, from a sequence of octet information items .
The use of tables and indexing is the primary mechanism by which the Fast Infoset compresses many of the strings present in an infoset. Recurring strings may be replaced with an index (an integer value) which points to a string in a table.
A serializer will add the first occurrence of a common string to the string table, and then, on the next occurrence of that string, refer to it using an index into the table. A hash table can be used for efficient checking of strings (the string being the key to obtaining the index; every time a unique string is added to the hash table, the index of the table is incremented).
A parser will add the first occurrence of a common string to the string table, and then, on the next occurrence of that string, obtain the string by using the index into the table. A simple array (which usually grows but may possibly shrink under certain conditions can be used.
The indexing of qualified names (a tuple of the [namespace name], [prefix] properties of a namespace information item and the [local name] property of an element information item or an attribute information item ) is a second level of indexing. A serializer will add the first occurrence of a qualified name tuple to the table (after adding each string to the string table if it has not occurred before) and then, on the next occurrence of that qualified name tuple, refer to it using an index into the table.
There are distinct string tables for the following strings:
The generic string table will be used for chunks of character information items .
Distinct string tables produce smaller indexes and therefore reduce the overall size of fast infoset documents. Besides, they allow the serializer to 'tune' the indexing according to the common properties of the strings in question. For example, prefixes will be short (often less than 5 characters) and namespace names will be URIs (often URLs with a common sequence of characters at the start of the string: "http://").
It is considered appropriate to make the indexing of certain strings mandatory, while allowing other types of strings to be optionally indexed with encoder constraints.
Strings corresponding to [namespace name], [local name], and [prefix] properties are very likely to repeat, thus the overhead (size and processing time) of encoding the boolean, indicating the addition of a string to the table, may be undesirable, especially when byte alignment and packing of index and length prefixes are important (an extra bit would result in less packing).
Strings corresponding to the [normalized value] property and to a chunk of character information items are less likely to repeat, hence it is considered appropriate to let the encoder choose whether to index such strings or not. The serializer can apply certain policies such as only indexing strings shorter than a given length or using contextual information to index common words in a language (potentially splitting strings into substrings such that indexing can be applied to the substrings).
Under certain circumstances, it may not be possible for a serializer to perform mandatory indexing.
A pre-computed table, which may be considered an additional property of the document information item , can improve the performance of the parsing process and of the serializing process (for example, a serializer can refer to indexes without having to search for strings to create indexes, and a decoder may not need to dynamically grow arrays) but is unlikely to reduce the size of fast infoset documents (unless there is specific knowledge about what strings may occur as character information items ).
A pre-computed table may be calculated from a schema (a W3C XML Schema or a DTD). Since this is a serializers' option, it is not necessary to specify what strings are indexed and the order of indexing.
A base table, which is a property of the pre-computed table, is a URI that identifies either an XML document or a fast infoset document, which may itself contain a pre-computed table (but no further base table). Either can provide a well-known vocabulary.
A caching mechanism may be used (with appropriate techniques to ensure that the cache is updated if such documents change), such that a fast infoset document may be reduced in size (because all or most strings are referred to by index and are not included in the document).
Two examples are presented below.
The first example shows an XML document and the indexes that will be associated with its items (strings and qualified names). The index that will be used for each item (either a string or a qualified name) is shown as a number surrounded by curly brackets before the item. An indexed item is shown as a number surrounded by square brackets replacing the item. Every string that is a chunk of character information items will be indexed in the generic string table.
The XML document in this example is:
<env:E<pre><root>
<tag>one</tag>
<tag>two</tag>
<anotherTag>one</anotherTag>
</root>
|
The corresponding representation with indexed strings and indexed qualified names is presented below in a symbolic form (end tags are struck-through since they are not required in fast infoset documents).
<env:E<pre>
{0}<root>
{1}<tag>
{0}one
</tag>
[1]<>
{1}two
</tag>
{2}<anotherTag>
[0]
</anotherTag>
</root>
|
The statement " [1]<> {1}two </tag> " means that the element information item has a qualified name that corresponds to index 1 in the qualified name table and that an index of 1 will be used for the chunk of character information items that is "two" in the generic string table.
The second example shows a SOAP 1.2 message, with one SOAP header block and a SOAP body containing a UBL document.
<env:Envelope
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<env:Header>
<xa:transaction
xmlns:xa="http://example.org/transaction">
<xa:id>3781</xa:id>
</xa:transaction>
</env:Header>
<env:Body>
<ors:OrderResponseSimple
xmlns:ors="urn:oasis:names:tc:ubl:OrderResponseSimple:1.0:0.70"
xmlns:cat="urn:oasis:names:tc:ubl:CommonAggregateTypes:1.0:0.70">
<cat:ID>1</cat:ID>
<cat:IssueDate>2003-02-03</cat:IssueDate>
<ors:AcceptedIndicator>1</ors:AcceptedIndicator>
<ors:RejectionReasonCode/>
<cat:Note/>
<cat:ReferencedOrder>
<cat:BuyersOrderID>20031234-1</cat:BuyersOrderID>
<cat:SellersOrderID>154135798</cat:SellersOrderID>
<cat:IssueDate>2003-02-03</cat:IssueDate>
</cat:ReferencedOrder>
</ors:OrderResponseSimple>
</env:Body>
</env:Envelope>
|
The corresponding representation with indexed strings and indexed qualified names is presented below in a symbolic form:
{0}<
{0}env:
{0}Envelope
[0]=
{0}"http://www.w3.org/2003/05/soap-envelope">
{1}<
[0]:
{1}Header>
{2}<
{1}xa:{2}transaction
[1]=
{1}"http://example.org/transaction">
{3}<
[1]:
{3}id>
{0}3781
</xa:id>
</xa:transaction>
</env:Header>
{4}<
[0]:
{4}Body>
{5}<
{2}ors:
{5}OrderResponseSimple
[2]=
{2}"urn:oasis:names:tc:ubl:OrderResponseSimple:1.0:0.70"
xmlns:
{3}cat=
{3}"urn:oasis:names:tc:ubl:CommonAggregateTypes:1.0:0.70">
{6}<
[3]:
{6}ID>
{1}1
</cat:ID>
{7}<
[3]:
{7}IssueDate>
{2}2003-02-03
</cat:IssueDate>
{8}<
[2]:
{8}AcceptedIndicator>
{3}1
</ors:AcceptedIndicator>
{9}<
[2]:
{9}
RejectionReasonCode/>
{10}<
[3]:
{10}
Note/>
{11}<
[3]BuyersOrderID>
{4}20031234-1
</cat:BuyersOrderID>
{13}<
[3]:
{13}SellersOrderID>
{5}154135798
</cat:SellersOrderID>
[7]<>
[2]
</cat:IssueDate>
</cat:ReferencedOrder>
</ors:OrderResponseSimple>
</env:Body>
</env:Envelope>
|
Note that the attribute information items in the [namespace attributes] property of an element information item are treated differently from the attribute information items in its [attributes] property. It is not necessary to index the "xmlns" prefix and it is only necessary to index the [local name] that is the [prefix] used for the namespace.
The statement " {0}< {0} env: {0}Envelope " means that an index of 0 will be used for the "env" prefix, for the "Envelope" [local name], and for the qualified name that consists of ("http://www.w3.org/2003/05/soap-envelope", "env", "Envelope"). The same index value can be used because [prefix]s, [local name]s, and qualified names are stored in different tables. The statement " [7]<> [2] </cat:IssueDate> " means that the element information item has a qualified name that corresponds to index 7 in the qualified name table and a chunk of character information items that correspond to index 2 in the generic string table. Table 2 presents the corresponding string tables for the SOAP message.
|
Table 2. String tables for example SOAP message
|
|
|
Index
|
Namespace name
|
Prefix
|
Local name
|
Generic string
|
|---|---|---|---|---|
|
0
|
http://www.w3.org/2003/05/soap-envelope
|
env
|
Envelope
|
3718
|
|
1
|
http://example.org/transaction
|
xa
|
Header
|
1
|
|
2
|
urn:oasis:names:tc:ubl:OrderResponseSimple:...
|
ors
|
xa
|
2003-02-03
|
|
3
|
urn:oasis:names:tc:ubl:CommonAggregateTypes:...
|
cat
|
id
|
1
|
|
4
|
|
|
Body
|
20031234-1
|
|
5
|
OrderResponse
|
154135798
|
|
|
|
6
|
ID
|
|
|
|
|
7
|
IssueDate
|
|
|
|
|
8
|
AcceptedIndicator
|
|
|
|
|
9
|
RejectionReasonCode
|
|
|
|
|
10
|
Note
|
|
|
|
|
11
|
ReferenceOrder
|
|
|
|
|
12
|
BuyersOrderID
|
|
|
|
|
13
|
SellersOrderID
|
|
|
|
It has not yet been determined if the Fast Infoset specification should support just one character encoding or allow an open-ended set of character encoding, just as XML allows.
Support for at least UTF-8 and UTF-16 character encoding schemes would ensure that both Western and Asian documents can be efficiently supported (for example, Asian documents encoded in UTF-8 can result in larger sizes than equivalent documents encoded in UTF-16).
The [character encoding scheme] property of the document information item declares the character encoding scheme that is used. The Fast Infoset uses the ASN.1 type UTF8String for this property.
To support multiple character encodings, the existing ASN.1 string types that support particular character encodings (for example UTF8String ) cannot be used, and instead the ASN.1 type OCTET STRING is required. A value of the OCTET STRING will be the sequence of bytes obtained from the characters encoded using the declared character encoding scheme.
Although in general it is not regarded as good ASN.1 practice to represent character strings with an OCTET STRING type, this is considered appropriate in this case, and it is not expected that it will increase the complexity of the implementations of the Fast Infoset standard.
Java implementations, J2SE 1.4.X and beyond, may avail of the NIO library for modular character encoding support. By default the following character encoding schemes are supported:
| US-ASCII | Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set |
| ISO-8859-1 | ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1 |
| UTF-8 | Eight-bit UCS Transformation Format |
| UTF-16BE | Sixteen-bit UCS Transformation Format, big-endian byte order |
| UTF-16LE | Sixteen-bit UCS Transformation Format, little-endian byte order |
| UTF-16 | Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark |
Aggressive indexing of character information items may result in an encoder having no more resources to index the mandatory properties of element information items and/or attribute information items . Alternatively, an encoder may decide that indexing is not appropriate for certain information items.
A suggestion has been made that such resource control of indexing could be provided by way of specific processing instruction information items that an encoder generates and a decoder processes and discards, thus such items will not be members of the constructed XML infoset. Three such instructions could be provided:
An encoder might insert a stop indexing instruction, after which the encoder would not perform mandatory and optional indexing until the encoder inserts a start indexing instruction.
An encoder might insert a reset indexing instruction, after which the encoder would discard all previous indexes, thus all tables would be reset (all indexes removed) and indexing would start at 0.
Given that these types of processing instruction information items would likely be far less common than other items (generally processing instruction information items are less common than other information items) it seems appropriate that the existing set of properties for a processing instruction information item be used rather than specifying an optimized form.
Canonicalization is an important feature for security, namely the signing of information. It has not yet been determined if the first revision of the Fast Infoset specification will specify a canonical form of fast infoset documents. Canonicalization can be difficult to specify correctly, and thus it may require more time. If so, it may be specified as a future amendment or as an additional specification.
A canonical form of a fast infoset document can be specified by:
If two fast infoset documents have the same canonical form, then the two documents are logically equivalent within the given application context.
The constraints on the XML information set supported by a fast infoset document consist of the following:
The constraints on fast infoset document features consist of the following:
The Distinguished Encoding Rules (DER) may be used, however this will not result in efficient encoding of Fast Infoset Documents, since they will be larger than using the Packed Encoding Rules with extensions. The latter can be used since the features of the ASN.1 Schema for a subset of the XML Information Set ensures that the PER encoding with extensions encoding is canonical.
Additional specification is required if canonicalization is applied to a fast infoset sub-document such that the sub-document is not dependent on information in the whole document, thus the sub-document can be removed from document without requiring re-canonicalization. Such canonicalization is referred to as exclusive canonicalization. The mechanisms specified for XML documents by the Exclusive XML Canonicalization specification may apply to fast infoset documents if specified in terms of the XML Information Set.
Since indexing may result in indexes in the sub-document that were the result of indexing before the sub-document, further constraints are required to isolate the indexing for the sub-document. A reset processing instruction could be used, but this can result in unnecessary re-indexing after the sub-document. Alternatively, one more encoder/decoder processing instruction can be used:
The first occurrences of the push-pop instruction push the current table (calculated up to that point, including a pre-computed and base table) onto a stack (the same effect as a reset), and the second occurrence pops the original table off the stack (thus there is no nesting and the stack will only contain, at most, one table). Two push-pop indexing instructions will surround the sub-document, and the constraints presented previously (where appropriate) apply to the items of the sub-document.
Features such a byte-alignment on well-defined boundaries will ensure efficient implementation when providing a sequence of octets (that corresponds to the sub-document) to be signed.
Integration into JAXP using the SAX API requires that a Fast Infoset reader (for parsing fast infoset documents) and a Fast Infoset handler (for serializing fast infoset documents) implement the SAX XMLReader and implement SAX ContentHandler (and possibly additional interfaces such as the SAX LexicalHandler) respectively. Fast Infoset readers can only parse SAX InputSources with a byte stream or system identifier.
The code extract below creates a fast infoset document from an xml document:
// Create a SAX parser
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
// Create a Fast Infoset handler to write a fast document
// to the file fi_document
FastInfosetEncoderHandler fieh =
new FastInfosetEncoderHandler(new File(fi_document));
// Parse an XML document from the file xml_document using
// the Fast Infoset handler
sp.parse(new File(xml_document), fieh);
|
The code extract below performs an identity transform to parse the fast infoset document created in the previous code extract and creates a copy, which is identical to the source:
// Create a transformer
Transformer tx = TransformerFactory.newInstance().newTransformer();
// Create a Fast Infoset reader
FastInfosetDecoderReader fidr = new FastInfosetDecoderReader();
// Create a Fast Infoset handler to write a fast document
// to the file fi_document
fieh = new FastInfosetEncoderHandler(new File(fi_document_cpy));
// Create an input source for the previously create
// fast infoset document
InputSource is = new InputSource(new File(fi_document));
// Perform the identity transform
tx.transform(new SAXSource(fidr, is), new SAXResult(fieh));
|
Finally the code extract below parses the copy of the fast infoset document and creates an XML document, which should be equivalent to the original xml document used to create the first fast infoset document:
// Create an input source
is = new InputSource(new File(fi_document_cpy));
// Create a stream result
StreamResult sr = new StreamResult(new File(xml_document_cpy));
// Perform the identity
tx.transform(new SAXSource(fidr, is), sr);
|
The same integration mechanisms applied to SAX may also apply to the Streaming API for XML (StAX), although it is not yet integrated into JAXP. To utilize the StAX API, it is required that a Fast Infoset stream reader and a Fast Infoset stream writer implement the StAX XMLStreamReader and StAX XMLStreamWriter respectively.
Octet information items are not supported by existing XML-based APIs. Reading and writing octet information items requires either an extended API or a conversion between them and the character information items .
The SAX API supports the concept of optional extensions by use of the features and property mechanisms, which are put to good effect for the support of processing DTD declarations and lexical information (such as comment information items ). In the same manner, the SAX API can be extended to support octet information items .
An instance of an interface:
public interface BinaryContentHandler {
public void octets(byte[] buf, int start, int length)
throws SAXException;
}
|
can be set as a property of the SAX XMLReader with, for example, the propertyId "http://xml.org/sax/properties/binary-content-handler".
If this property is not set, then a SAX reader will convert a sequence of octets to a sequence of character information items (Base64 encoding) and invoke the ContentHandler.characters method.
A SAX-based prototype has been implemented based on Fast Infoset ideas, some of which differ from the ideas presented in this article. However, it is anticipated that the results of an updated prototype will be similar to, or better than, the results presented here.
The prototype differs mainly in the approach to indexing of qualified names and namespaces: a single string using the [prefix], a colon and [local name] or [uri] is used, thus there is no separate [prefix] string table or [local name] table. This is potentially slower for first occurrences of qualified names and less compact than what is now proposed, since more work is required to obtain a prefix (substring operations), and a prefix is repeated multiple times.
The first set of results use the XBIS SAX-based performance test suite to measure parsing, serializing and the resulting size of documents.
Measurements were performed using J2SE 1.4.2 with Linux on a Sony Vaio PCG-GRX700P. Default JVM options were used. The Xerces SAX parser was used, as supplied by XBIS.
Two data sets of XML documents were chosen. The first data set, classified as small documents, consisted of the following XML documents:
The second data set, classified as large documents, consisted of the following XML documents:
The fast infoset serializer was configured to optionally index strings less than 7 in length.
Results are presented in the form of percentages calculated by dividing the fast infoset measurement by the XML measurement and multiplying by 100 to obtain a percentage.
Figure 2 presents the results for the small documents.
|
The parse % is between 20% and 32% or approximately 5 to 3 times faster. The serialize % is between 10% and 18% or approximately 10 to 5 times faster. The size % is between 30% and 74% or approximately 3.3 to 1.3 times smaller.
Figure 3 presents the results for the large documents.
|
The parse % is between 23% and 33% or approximately 4.3 to 3 times faster. The serialize % is between 9% and 17% or approximately 11 to 5 times faster. The size % is between 30% and 74% or approximately 5 to 1.7 times smaller.
The XBIS measurement framework utilizes the XMLWriter, from David Megginson, for serializing XML documents. Further investigation is required to ascertain if the XMLWriter performance can be improved, or alternative approaches will be more performant (for example, JAXP and an identity transform). The serializing of XML documents utilizing binding frameworks or in application specific scenarios may result in much faster serialization, for example the application may know that certain text content will not contain characters that require escaping.
The size of fast infoset documents will be related to the quantity of repeating information in the XML document. Small XML documents are likely to result in a moderately smaller fast infoset documents, for example the small SOAP messages reduce by 25% since there is not much repeating information. Large XML documents may result in better reduction because there is more chance of repeating information, for example the soap2.xml XML document reduces by approximately 80%. However, large XML documents containing a large amount of text content, for example the factbook.xml XML document, are unlikely to result in a large reduction in size due to the fact that the text content is not indexed, and it is unlikely that the text (without deconstructing the text into smaller parts) will repeat.
For large XML documents, an approximation of fast infoset document size may be obtained by using the heuristic measurement of the total size of encoded characters (attribute values, text content and comments). Figure 4 presents the % of encoded characters for the large XML document. This correlates closely with the Fast Infoset size % in Figure 3.
|
The second set of results uses a modified version of the XML Test performance suite (to support the Fast Infoset), developed to compare the performance of XML parsers in Java and .Net. XML Test is designed to mimic the processing that takes place in the lifecycle of an XML document, and measures the throughput of a system processing XML documents.
Measurements were performed using J2SE 1.4.2 with Solaris 9 on a Sun Fire 280Ra. The JVM options specified by XML Test performance suite were used. The Xerces SAX parser shipped with JAXP was used.
The data used was various sizes of documents based in a simplified version of the UBL schema.
Further details about the measurement process and data used may be found in XML Test performance suite (Note: the modify and serialize stages do not apply for the SAX-based measurements, thus the tests only measure parsing and access stages).
Figure 5 presents the throughput results for the UBL documents. Results are presented in the form of ratios calculated by dividing the XML measurement by the fast measurement.
|
The authors would like to acknowledge the XBIS and XMLS (a previous incarnation of XBIS) work of Dennis Sosnoski. The ideas of XBIS and XMLS have proved valuable for specifying encoded features of fast infoset documents. In addition, the XBIS performance test suite has proved very convenient for performance testing.
The authors would like to thank Binu John, of Sun Microsystems, for performing the XML Test measurements.
Paul Sandoz
Paul is a Staff Engineer at Sun Microsystems
Alessandro Triglia
Alessandro is a Member of the Technical Staff at OSS Nokalva, Inc.
Santiago Pericas Geertsen
Santiago is a Staff Engineer at Sun Microsystems.
|
Java SDKs and Tools
|
||
|
Java Resources
|
||
