XML and XSLT Overview



Section 1. Introduction
Section 1.1. Purpose

Extensible Markup Language [1] (XML), what is it and what is it good for? This paper will cover the basic concepts of markup languages and describe the syntax of XML [1] as well some of its uses and potential benefits to computing. Additionally it will discuss the basics of extensible stylesheets [2] (XSLT) and how they can be used to transform XML documents.

The intended audience of this paper is people who already have a basic understanding of computers and how they operate. Section 1. Introduction and Section 2. XML Introduction are largely conceptual and won't require much technical knowledge. Section 3. XSLT Introduction will deal with creating an extensible stylesheet and will be more complex.

This document is written in an XML based language called DocBook [7]. The first sections of this paper will give the reader a good enough understanding of XML to be able to understand the DocBook source and then the XSLT section will describe creating a set of templates to transform this document from DocBook into XHTML [3] for viewing in a web browser.

Section 1.2. Contact Information

Any questions or comments should be directed to Will Holcomb (will@himinbi.org). The latest and authoritative version is available at http://odin.himinbi.org/xml_and_xslt_overview/. Specifically the files used as examples in this program are:


Section 2. XML Introduction
Section 2.1. Markup Basics

XML [1] is not a programming language like C or Java, rather it is a "markup" language. It is used to contextualize other data. Most people will find it easiest to understand XML as a variation on the concepts used in Hypertext Markup Language [3] (HTML).

HTML documents are the lingua franca of the world wide web. They are simply plain text documents with special "tags" embedded in them to give information to the web browser about the structure of the document.

Changes to HTML: People familiar with generation of HTML prior to 4.0 will expect it to format appearance with tags such as underline (<u>) and bold (<b>). As the purposes of HTML have matured the appearance formatting elements have been deprecated and HTML is only used to describe the structure of the document. [4] Cascading stylesheets [5] (CSS) are used to describe the visual appearance of a document. CSS may be used in conjunction with XML based documents as well, though exploration of that topic is beyond the scope of this paper.

A simple example of HTML could be if I was writing about XML and wanted to place emphasis on how useful it could be. I would add two tags to my document: the open tag <em>, and the close tag </em>. I would enclose the text I wanted to emphasize with those tags like this:

XML has the potential to be <em>very</em> useful.
      

When this HTML was being rendered in a web browser the text inside the <em></em> would be drawn differently. By convention the content of emphasis tags is drawn in italics, however using CSS the author could choose to set it out in any way she saw fit. Also the HTML could be rendered into sounds using a page reader; there the content of the emphasis tags might be recognized by a change in tone.

The <em> tags are not rendered in either case because they are not content, rather they contextualize ("mark up") the content.

Sometimes tags require additional information in order to be useful. The image tag (<img>), for instance, describes the placement of a graphic file within a document. In order for it to be used however the name of the graphic file has to be provided. This is done using the source (src) attribute:

<img src="http://www.himinbi.org/images/nosferatu.png">
      

An attribute is simply a name/value pair where the name and value are separated by an equal sign (=). A tag can have more than one attribute, they are simply separated by whitespace.

Section 2.2. Differences between HTML and XML

HTML and XML are very similar in syntax, because both are subsets of Standard Generalized Markup Language (SGML). Both are used for marking up documents using tags. XML however has several important restrictions placed on it.

There are two primary reasons for the number of restrictions that have been placed on XML in comparison to HTML. The main one is simplicity; a program to deal with XML can be much simpler than a program that has to be prepared with the intricacies and exceptions of HTML. The program can use less memory and can be less processor intensive which is ideal for smaller devices like wireless phones and pda's. Another consideration has to do with the way that the world wide web is developing. There are many people producing HTML for publishing who have little or no experience with HTML. This, along with poorly written editors, and browsers including proprietary extensions to HTML is producing a huge variety in the types of documents that can be found on the web. As the variety of documents gets wider, browsers have to become larger and more intelligent if they are going to render the HTML as the author intended. The simplicity and restrictions placed on XML does not restrict any of the expressiveness; it does however force people to write correct documents.

To speak of HTML and XML like this gives a false impression that they are more related than they are in reality. HTML is a specific set of tags for describing the structure of hypertext documents. XML is not a specific set of tags at all, rather it is a set of rules for describing what properties XML based languages must have.

Section 2.3. Applications of XML

There could be hundreds or thousands of different languages that conform to the rules set out in XML. XML as a format is still growing in popularity, but many interesting languages already exist:

Section 2.4. Using XML

The basic benefit of XML is the same as that of HTML: there exists a specification somewhere saying how a document may be marked up and so long as I conform to that specification my document should render correctly in any web browser that someone wants to use. An XML document will not necessarily describe a webpage, but the contract specifying what tags are allowed where still exists and I can write my documents according to the specification and programs supporting that type of markup should deal with my document with no problem.

There are two main ways that the contract (what tags are allowed where, called a Document Type Definition (DTD)) is specified. The original way was to use SGML which is the same way that HTML is specified. SGML simply sets out a Backus Naur Form (BNF) for the document. A more modern way of laying out the document is an XML based language called XMLSchema. XMLSchema is more complex and allows a tighter specification of the tags as well as allowing templates for different types to be created. Another advantage is that with XMLSchema you don't need a standalone parser to parse the SGML, the same parser can parse both the Schema and the document. SGML and XMLSchema are both languages in their own right and sufficiently complex not to cover here.

Another big advantage of XML has to do with parsing. Because of the restrictions placed on the language it is possibly to create a standard parser that will deal with any XML document. Creating configuration files in XML allows the author so specify what values are allowed in the DTD and also saves the author from having to parse the configuration file herself. There are two main classes of parsers that can be used depending on the needs of the application:


Section 3. XSLT Introduction
Section 3.1. XSLT Overview

One way to view an XML document is as a tree. For instance given this simple XHTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
  <html>
    <head>
      <title>Will's Webpage</title>
      <link rel="stylesheet" type="text/css" href="main.css" />
    </head>
    <body>
      <h1>Welcome!</h1>
      <p>
        Hello, everyone. Welcome to my little spot on the world wide web. 
        Hope you are having a nice day.
      </p>
    </body>
  </html>
      

You will notice that since this is a valid XML document as well as being HTML the empty link tag ended with a />. Also all of the <p> tags ended with a </p>. This is not just an authors preference it is required for this to be valid XML.

This document can be represented as a tree like this:

As you can see each tag is a child of the tags that it is nested inside of. If you use a DOM parser in dealing with your XML files then this is the basic layout that you will deal with. In XML there is only allowed to be one element at the very top (called the "root" element; it is <html> in this case.) This tree layout is also what you deal with in XSLT.

XSLT is a functional language, this means that unlike an inperative language like C you will not write your program as a set of procedures for accomplishing a task. One analogy for functional languages is working in a spreadsheet; instead of specifying how to compute the value for an individual cell you express the value of that cell as a relationship of other cells and then let the computer figure out how to get the answer. In XSLT you will write a set of templates to "match" different nodes and then the program will run by matching your input document with the templates.

Section 3.2. DocBook subset used in this tutorial

For our example we are going to step through a simple set of templates to transform this document from the DocBook that it is written in into XHTML for presentation on the web. DocBook is a fairly complex language and by no means will these templates convert all of DocBook to XHTML. The most commonly used program to deal with DocBook is called Jade. It is written using another transformation language, Document Style Semantics and Specification Language (DSSSL) and will handle the source for this document it may be though that we have a special need that Jade does not address, so writing a new set of transformations is not just an exercise. For this though we need only deal with the few tags used in this document which are:

DocBook subset used in XML tutorial
Tag Name Description
section Marks a section of the document.
<para> Marks a paragraph.
<article> Root element.
<mediaobject><imageobject><imagedata/></imageobject></mediaobject> Marks an image in the document. The structure is designed to allow a parser to choose among several formats, but this document uses only images.
<programlisting> Marks a listing of some sort. The text is assumed to be preformatted and whitespace cannot be ignored as in regular HTML.
<table><tgroup><thead/><tbody><row><entry/></row></tbody></table> Marks the layout of a table.
<itemizedlist><listitem/></itemizedlist> Mark a list of items.
<title/> Used in several tags like section, article and table to mark respective titles.
Section 3.3. Templates
Section 3.3.1. Basic Structure

Because XSLT is an XML based language we will have to create a single root element to contain all the templates. A couple of other things that the basic layout will do is specify a doctype and a namespace. I haven't talked about these in detain, but namespaces are ways to have multiple XML languages in the same document. Say you have a report with some equations and graphs in it. The base report could be written in XHTML, but for the equations you could mix in some MathML and for the graphs some SVG. A namespace simply precedes the tag and is separated by a colon. A namespace can be any string, but for this example I am going to use xsl: to represent the XSLT namespace.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output
   method="xml"
   standalone="yes"
   indent="yes"
   doctype-public="-//W3C//DTD XHTML 1.1//EN"
   doctype-system="http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"/>
    <!-- This is a comment in SGML (thus HTML and XML
       - All of the templates will go in this space and any tag that
       -  is not prefaced with <xsl: will go to the output document
      -->
</xsl:stylesheet>
        

Each template now will just be a <xsl:template> child of the main <xsl:stylesheet> root element.

Section 3.3.2. <para> Template

Because XSLT is a functional language we don't have to worry about going in any certain progression. We can start with the easier templates and work our way to the more complicated ones. The easiest is going to be <para> because there is a one-to-one correspondence between DocBook and XHTML. Both have a paragraph tag and we are not adding any formatting. So, we simply have to create a template that will match the <para> tag and will place the contents of that tag inside XHTML <p></p> tags. Such a template would look like this:

<xsl:template match="para">
  <p><xsl:apply-templates/></p>
</xsl:template>
        

Seems simple enough. In addition to the <template> tag you will also notice an <apply-templates> tag. <apply-templates> just instructs the processor to do another set of matches. It can take an optional "select" attribute to choose a set of nodes to match, but by default it simply matches the children of the current node.

Section 3.3.3. <programlisting> Template

Another tag which is relatively simple is <programlisting>. Let's say that in our output we want for a program listing to be preformatted and enclosed in a gray box. In XHTML we would do this with a <pre> tag and a special style. To make it easy to change the appearance of the tag instead of placing style information in all the <pre> tags we will use the "class" attribute to designate that they are all of a similar type. So the template would look like:

<xsl:template match="programlisting">
  <pre class="listing">
    <xsl:apply-templates/>
  </pre>
</xsl:template>
        
Section 3.3.4. <mediaobject> Template

The <mediaobject> tag for DocBook is fairly sophisticated. We are going to only allow it to include a single image in a document. The structure of the tag will be assumed to be:

<mediaobject>
  <imageobject>
    <objectinfo>
      <title>Description</title>
    </objectinfo>
    <imagedata fileref="path/to/file"/>
  </imageobject>
</mediaobject>
        

This is obviously a much more sophisticated set of tags. As I mentioned before we are only using about 3% of the functionality of <mediaobject>. All of this information is going to map to a single <img/> tag. Specifically we are going to select the values of children of <mediaobject> using a special syntax where values inside of {} are replaced with the value of selecting those children. The template would look like:

<xsl:template match="mediaobject">
  <img
    src="{imageobject/imagedata/@fileref}"
    alt="{imageobject/objectinfo/title}"/>
</xsl:template>
        

There is also a <value-of> tag in XSLT that does the same thing as the {} notation. It can't be placed inside of a tag like the {} notation because that is not valid XML. If you wanted to use the <value-of> tags instead you would have to use the <element> tag. <element> is replaced with a tag specified by the "name" attribute in the output. For instance this image tag would look like:

<xsl:template match="mediaobject">
  <xsl:element name="img">
    <xsl:attribute name="src">
      <xsl:value-of select="imageobject/imagedata/@fileref"/>
    </xsl:attribute>
    <xsl:attribute name="alt">
      <xsl:value-of select="imageobject/objectinfo/title"/>
    </xsl:attribute>
  </xsl:element>
</xsl:template>
        

One advantage of this syntax is you can use the decision structures that XSLT has to decide the value of attributes. Say for instance you decided that if they didn't include the <objectinfo> then you wanted to leave the "alt" attribute out. The template could be written as:

<xsl:template match="mediaobject">
  <xsl:element name="img">
    <xsl:attribute name="src">
      <xsl:value-of select="imageobject/imagedata/@fileref"/>
    </xsl:attribute>
    <xsl:if test="count(imageobject/objectinfo) > 0">
      <xsl:attribute name="alt">
        <xsl:value-of select="imageobject/objectinfo/title"/>
      </xsl:attribute>
    </xsl:if>
  </xsl:element>
</xsl:template>
        

The language that is being used to specify the child nodes to select is called XPath and it is a fairly robust language for selecting nodesets from trees. The default is to select children, but it can do more complex sets like ancestors, preceding siblings, nodes with certain attributes, and nodes a certain depth in the tree.

Section 3.3.5. <table> Template

<table> is also fairly straightforward in how it translates to XHTML. Both DocBook and HTML 4.0 are based off of the Continuous Acquisition and Life-Cycle Support (CALS) table model. This is slightly different than the HTML 3.0 table model, but is fairly simple in general. From the HTML 4 SGML DTD the model is:

<!ELEMENT TABLE    - - (CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>
<!ELEMENT CAPTION  - - (%inline;)*  -- table caption -->
<!ELEMENT COL      - O EMPTY        -- table column -->
<!ELEMENT COLGROUP - - (COL)*       -- table column group -->
<!ELEMENT THEAD    - O (TR)+        -- table header -->
<!ELEMENT TFOOT    - O (TR)+        -- table footer -->
<!ELEMENT TBODY    O O (TR)+        -- table body -->
<!ELEMENT TR       - O (TH|TD)+     -- table row -->
<!ELEMENT (TH|TD)  - O (%flow;)*    -- table header cell, table data cell-->
        

If you are familiar with BNF then this ought not be too difficult to read. The model for DocBook is about the same except instead of having <tr> there is <row> and instead of <th> or <td> there is <entry>. Rather than having a single template to create a table however we can break it up into several distinct templates:

<xsl:template match="table">
  <table>
    <xsl:if test="count(title) > 0">
      <caption><xsl:value-of select="title"/></caption>
    </xsl:if>
    <xsl:apply-templates select="child::*[name() != 'title']"/>
  </table>
</xsl:template>

<xsl:template match="thead">
  <thead><xsl:apply-templates mode="header"/></thead>
</xsl:apply-templates>

<xsl:template match="tbody">
  <thead><xsl:apply-templates/></thead>
</xsl:apply-templates>

<xsl:template match="row">
  <tr><xsl:apply-templates/></tr>
</xsl:template>

<xsl:template match="row" mode="headed">
  <tr><xsl:apply-templates mode="headed"/></tr>
</xsl:template>

<xsl:template match="entry">
  <td><xsl:apply-templates/></td>
</xsl:template>

<xsl:template match="entry" mode="headed">
  <th>xsl:apply-templates/></th>
</xsl:template>
        

Something new you see here is the "mode" attribute of <apply-template> and <template>. For any given <apply-templates> only one template will match a given mode. The most specific one is used. For instance you could have a match="tr" and a match="*"; both will match a tr tag, but the * is less specific than the explicit tr. The "mode" attribute lets you make a match more specific so the elements of <thead> only match templates with the mode="headed". It is necessary to propagate the header information down because the <th> tag is nested. Another option is to simply not use the <th> at all and use CSS [5] to make the text in the <thead> look like you want it to.

This template could be written another way:

<xsl:template match="thead|tbody|tfoot">
  <xsl:apply-templates>
    <xsl:with-param name="tag">
      <xsl:choose>
        <xsl:when test="name() = 'thead'">
          <xsl:text>th</xsl:text>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text>td</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:with-param>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="row">
  <xsl:param name="tag"/>
  <tr>
    <xsl:for-each select="entry">
      <xsl:element name="{$tag}">
        <xsl:apply-templates/>
      </xsl:element>
    </xsl:for-each>
  </tr>
</xsl:template>
        

This is a little more complicated and uses a <choose> inside of a <with-param> to pass the desired tag to the template. A <for-each> then loops through the elements and creates the rows. This is a more advanced type of operation and better left until you have a grasp of the basics.

Section 3.3.6. <itemizedlist> Template

This is another tag that has a 1:1 correlation between DocBook and XHTML so I will leave it as an exercise to the reader to figure it out. The template is included in the full stylesheet at the end.

Section 3.3.7. <section> Template

The <section> template is going to be perhaps the most complicated. Sections are going to be numbered according to their depth, so the third section inside of section one would be Section 1.3. This number is handled automatically by XSLT which is fortunate because counting in a functional language is difficult. Also we are going to place an anchor at the beginning of each section so that later we can link between them. The anchoring for this document is based on each <section> in the DocBook source having an "id" attribute with a value that is unique for the document. This is not an uncommon property for DocBook documents because that unique id is used sometimes in cross referencing.

We could put the sections in a list to indent them on the screen, but instead we will signify the depth with both a change in the number and a change in the font size.

<xsl:template match="section">
  <a name="{@id}"/>
  <xsl:element name="div">
    <xsl:attribute name="class">
      <xsl:text>section-</xsl:text>
      <xsl:value-of select="count(ancestor::*)"/>
    </xsl:attribute>
    <xsl:text>Section </xsl:text>
    <xsl:number level="multiple" format="1.1."/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="title"/>
  </xsl:element>
  <xsl:apply-templates select="child::*[name() != 'title']"/>
</xsl:template>
        

As you hopefully can tell I create a <div> in the output HTML with a class attribute of section- and then then depth of the section in the tree. Then I use <number> to create my numbering and put the contents of the title tag after the preface. You might have noticed earlier that I am using <text> tags to surround output text that is not inside of a tag. This is not strictly necessary. If you do not do it this way any whitespace before or after your text may get included in the output document which is not usually a problem in HTML, but can cause issues occasionally.

Section 3.3.8. <article> Template

We are pretty much done except for one obvious issue. There is no base for the document. Nowhere in the templates we have defined are the <html> tags placed in the output. When the XSLT engine starts processing it will try to match the root of the document first. In this case the root is <article>. To put the groundwork down for the document we will write a template to match <article>. Something that might be nice in a document like this would be to have a listing of the different sections so we can jump around, so we will make one of those too. The templates would look something like:

<xsl:template match="article">
  <html>
    <head>
      <title><xsl:value-of select="title"/></title>
      <link rel="stylesheet" type="text/css" href="document.css"/>
    </head>
    <body>
      <h1><xsl:value-of select="title"/></h1>
      <hr style="width: 45%;"/>
      <xsl:call-template name="toc-outline"/>
      <hr style="width: 45%;"/>
      <xsl:for-each select="section">
        <xsl:apply-templates select="."/>
        <xsl:if test="count(following-sibling::section) > 0">
          <hr style="width: 25%;"/>
        </xsl:if>
      </xsl:for-each>
      <xsl:if test="count(articleinfo/legalnotice) > 0">
        <hr style="width: 85%;"/>
        <xsl:apply-templates select="articleinfo/legalnotice"/>
      </xsl:if>
    </body>
  </html>
</xsl:template>

<xsl:template name="toc-outline">
  <xsl:if test="count(section) > 0">
    <ul>
      <xsl:apply-templates select="section" mode="outline"/>
    </ul>
  </xsl:if>
</xsl:template>

<xsl:template match="section" mode="outline">
  <li>
    <xsl:element name="a">
      <xsl:attribute name="href">
        <xsl:text>#</xsl:text><xsl:value-of select="@id"/>
      </xsl:attribute>
      <xsl:call-template name="section-name"/>
    </xsl:element>
    <xsl:call-template name="toc-outline"/>
  </li>
</xsl:template>

<xsl:template name="section-name">
  <xsl:text>Section </xsl:text>
  <xsl:number level="multiple" format="1.1."/>
  <xsl:text> </xsl:text>
  <xsl:value-of select="title"/>
</xsl:template>
        
Section 3.4. Processing using XSLT

To generate this correctly you will need several things:

There are several XSLT engines that you can choose from. The one that I am using is XAlan written by the Apache project's XML group. To use it you would type:

java org.apache.xalan.xslt.Process -IN xml_and_xslt_overview.xml
                                   -XSL docbook_subset-xhtml.xsl
                                   -OUT xml_and_xslt_overview.html
        


1: XML is defined by the World Wide Web Consortium (W3C) [http://www.w3c.org] and the specification is available at [http://www.w3.org/XML/]

2: XSLT is also defined by the W3C and its specification is at [http://www.w3.org/TR/xslt]

3: HTML, like XML and XSLT is defined by the W3C. It is one of the oldest W3C specifications and the generations are at [http://www.w3.org/MarkUp/]. We will deal specifically with HTML 4, [http://www.w3.org/TR/html4/], and XHTML, [http://www.w3.org/TR/xhtml11].

4: For authors wishing to create a website and wanting to make sure that their HTML conforms to the standards there is a validation service available online at the W3C, [http://validator.w3.org].

5: CSS is a W3C specification as well. Information is collected at [http://www.w3.org/Style/CSS/]. There are two main versions: CSS1 [http://www.w3.org/TR/REC-CSS1] and CSS2 [http://www.w3.org/TR/REC-CSS2]. There is also a CSS validator available online, [http://jigsaw.w3.org/css-validator/].

6: Like most other web standards SVG is defined by the W3C. The spec is at [http://www.w3.org/TR/SVG/]. There are several SVG viewers. Adobe is pushing SVG hard as the export format for Illustrator and in its new web tools suite. A browser plug-in for SVG is available from Adobe, [http://www.adobe.com/svg/viewer/install/]. Also an open source SVG viewer called Batik [http://xml.apache.org/batik/] has been developed by the Apache project.

7: The standard for DocBook is maintained as a book from O'Reilly Publishing [http://oreilly.com/] and the latest version is available online at [http://www.docbook.org].

8: MathML is another W3C recommendation [http://www.w3.org/Math/]. There are several implementations including one integrated into the Mozilla web browser [http://www.mozilla.org].


Copyright (c) 2001 Will Holcomb (will@himinbi.org)

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation.