XML-Schemas without Namespaces

Jean-François Perrot

  1. General ideas
    1. Shortcomings of DTDs
    2. XML-Schemas are the W3C's answer

  2. Types for XML trees
    1. Datatypes
    2. Simple types
    3. Complex types

  3. Three ways of giving a type to an element or attribute
    1. Declaring types as separate entities (with their own names)
    2. Associating an anonymous type directly to an element (attribute)
    3. Reference
    4. Extending a schema : xsd:include / xsd:import

  4. Schema-based Validation
    1. Validation strategy
    2. Schema-based validation with strategy A
    3. Schema-based validation with strategy B

  5. Grammar vs. Type System


  1. General ideas

    1. Shortcomings of DTDs

      • DTDs do not accommodate namespaces, i.e. namespace prefixes are seen as integral parts of names.
      • Limited capacity to specify constraints on strings (no reg. exp. !)

    2. XML-Schemas are the W3C's answer

      to the need to improve on DTDs for specifying the structure of XML documents.
      The complexity of the system prompted the OASIS consortium to come up with another proposal (Relax-NG), which is easier to use.
      However, XML-Schemas remain the standard in a number of protocols, notably Web Services.

  2. Types for XML trees

    written with XML syntax, with namespace : xmlns:xsd="http://www.w3.org/2001/XMLSchema".
    see examples below
    1. Datatypes

      for elementary data : strings, numbers, etc. e.g. xsd:string , xsd:int , xsd:date, etc.
      A (very sucessful) collection : http://www.w3.org/TR/xmlschema-2/

    2. Simple types

      (<xsd:simpleType>) for elements with only text content.
      The type describes the text content (string),
      mainly by restrictions on a datatype.
    3. Complex types

      (<xsd:complexType>) for everything else.
      The structure of child-nodes is either sequence or choice.
      N.B. Attributes are declared after the child-nodes.

  3. Three ways of giving a type to an element or attribute

    1. Declaring types as separate entities (with their own names)

      and associating explicitly the element or attribute with the type.

      • <xsd:complexType name = "myType">...</xsd:complexType>

        <xsd:element name= "myElement" type = "myType" />


      • <xsd:simpleType name = "myAttrType">...</xsd:simpleType>

        <xsd:attribute name= "myAttr" type = "myAttrType" />


      Examples :

      In this way elements (attributes) with different names can share the same type.
      In other words, types may be reused.

    2. Associating an anonymous type directly to an element (attribute)

      • <xsd:element name= "myElement">
           <xsd:complexType>
          
        ...
           </xsd:complexType>
        </xsd:element>


      • <xsd:attribute name= "myAttr">
           <xsd:simpleType>
           ...
           </xsd:simpleType>

        </xsd:attribute>

      Systematic use of this technique leads to the so-called russian doll design.
      No type sharing, no reuse.

      Examples : Names & Marks #1 , Names & Marks #2, [candidate XML files #1, #2]

    3. Reference

      Instead of defining a type, simply say that an element (attribute) is of the same type (and name) as another,
      by using a reference to it.
      Examples :

    4. Extending a schema : xsd:include / xsd:import

      Examples

      N.B.
      • Use xsd:include to bring in a schema from the same or no namespace.
      • Use xsd:import to bring in a schema from a different namespace (see later).
  4. Schema-based Validation

    1. Validation strategy

      There are two main approaches to validation : given a validator (operator), and an object (XML document to be validated),
      who chooses the reference (or norm, or standard) to be used for validation ?
      1. either the reference is chosen by the validator,
        i.e. the validation process gets two arguments : validate the object against the (explicitly chosen) reference

      2. or the reference is provided by the object,
        and the validation process gets only one argument : validate the object (against the implicitly given reference).
        In particular, validation may be effected by the parser.

      Clearly, both strategies reflect different attitudes towads validation. Additionnally

      • DTDs are meant to be used with strategy B
        • An indication of the DTD is an integral part of the specified XML document.
          • For XML files, there is a specific syntactic device <!DOCTYPE...>
          • For org.w3c.dom.Document objects, the DOM includes an interface for DTD objects, called org.w3c.dom.DocumentType,
            together with a method createDocumentType(...) in the org.w3c.dom.DOMImplementation interface.

          Accordingly : xmllint --valid myFile.xml
          See here for a Java implemenation of DTD-based validation by the parser.

        • You need a special tool like xmllint to validate against a "foreign" DTD.
          "xmllint myFile.xml --dtdvalid myGrammar.dtd" will use myGrammar.dtd as reference,
          even if myFile.xml does contain a (pointer to) aDTD.
          Actually, the "illogical" invocation "xmllint --valid myFile.xml --dtdvalid myGrammar.dtd" will also use myGrammar.dtd.

      • On the contraray, XML Schemas are meant to be used with strategy A (see below).
        From a historical perpective, there is clearly a shift in technology from B to A.

      • However, there is a specific syntax for attaching a schema to a file, so as to use strategy B as well.
        This feature is used by Eclipse.
        Note that this is no longer possible for the third validation framework : RelaxNG.
    2. Schema-based validation with strategy A

      • Validation with xmllint
        %xmllint --noout myFile.xml --schema mySchema.xsd

      • Validation with javax.xml.validation

        • The basic mechanism is set up in class SchemaValidate.
          Note that the action of the Validator object is empty if the validation succeeds, and to raise an exception if it fails.
          As a consequence, failed validation stops the process.
          To fullly appreciate this feature, compare with PHP's handling of the same problem, where DOMDocument::schemaValidate returns a boolean value, does not stop the process, and leaves error-handling to the programmer : Average_1V.php (see below)

        • Here is a simple illustration of the "natural" use of this technique for checking the validity of a document prior to using it.
          We apply it to our first example Average_1, computing the average mark of a Names & Marks, model 1 document :
          see Average_1V.
          Note that we check on the parsed Document (in order to avoid duplication of disk access), by means of a DOMSource.
    3. Schema-based validation with strategy B

      • Attaching a schema
        by means of a specific attribute of the root tag, belonging to another namespace.

        <?xml version="1.0" ?>
        <myRootTag xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:noNamespaceSchemaLocation="URL-or-Path-pointing-to/mySchema.xsd" >


      • Java implementation

      • Eclipse

  5. Grammar vs. Type System

    Two points of view on essentially the same contents.

    1. A Grammar specifies how entities are composed, in a top-down fashion.
      • In Computer Science, the word grammar ususally refers to context-free grammars, used to describe the concrete syntax of programming languages.
        Such grammars describe the structure of character strings (programs).

      • DTDs are tree-grammars, describing the structure of XML trees.
        A tree like this one will be deemed correct wrt this DTD if

        1. its root tag conforms to the grammar rule
          <!ELEMENT Car (Body, Engine, Transmission)>
          <!ATTLIST Car make CDATA #REQUIRED>
          <!ATTLIST Car model CDATA #REQUIRED>


        2. the 3 child nodes of the root tag all conform to the respective rules
          <!ELEMENT Body (Hood)>
          <!ATTLIST Body color  CDATA #REQUIRED>

          <!ELEMENT Engine (Cylinders, Ignition)>
          <!ELEMENT Cylinders EMPTY>
          <!ELEMENT Ignition (#PCDATA)>

          <!ELEMENT Transmission (GearBox, FrontAxle, RearAxle)>
          <!ATTLIST Transmission type (automatic | manual) #REQUIRED>
          <!ATTLIST Transmission gear_nb (3 | 4 | 5) #REQUIRED>

        3. the 6 grandchild nodes of the root also conform to the respective rules
          <!ELEMENT Hood (#PCDATA)>
          <!ELEMENT Cylinders EMPTY>
          <!ELEMENT Ignition (#PCDATA)>
          <!ELEMENT GearBox EMPTY>
          <!ELEMENT FrontAxle EMPTY>
          <!ELEMENT RearAxle EMPTY>

    2. The idea of a Type System comes from programming languages :
      given a program, the aim is to attach to each construct of the program a qualification (called a type) in a bottom-up fashion.
      The type-checking process is conducted according to typing rules.
      The program will be correct if the type-checking process succeeds in assigning a type to the whole program.

      An XML Schema is a type system.
      For instance the same XML tree will be checked by defining a type for each subtree and for each attribute :