Validation by parser

trying to answer a question by Niels Dawartz


  1. Validation Strategy
    1. Two main strategies
    2. Evolution

  2.  Java technique for DTD-based validation
    1. DTD-based validation by the parser is provided by the SAX API,
    2. DTD-based validation by the parser is also possible with the DOM API.

  3. Java technique for schema-based validation by the parser
    1. Attaching an XML schema to an XML document
    2. Schema-based validation by the parser is also possible with the DOM API.

  1. Validation Strategy

  2.  Java technique for DTD-based validation

  3. Java technique for schema-based validation by the parser

    Schema-based validation by strategy #1 in Java is operated with the javax.xml.validation API. See here.
    This section is devoted to the implementation of strategy #2.
    As a note of caution, observe that the XML Schema Recommendation explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation attributes is only a hint; it does not mandate that these attributes must be used to locate schemas.

    1. Attaching an XML schema to an XML document

      • Without namespaces in the document
        the path (or the URL) toward the schema file is indicated by an attribute "xsi:noNamespaceSchemaLocation" in the root tag,
        with the namespace xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance".

        Example :
        <?xml version="1.0" encoding="UTF-8" standalone="no"?>
        <list xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:noNamespaceSchemaLocation = "../../NamesMarks/NM1-1.xsd" >

          <student name="Luc" mark="12"/>
          <student name="Hélène" mark="17"/>
          ......
        </list>


      • With namespaces
        Please review Namespaces in XMLS for the notion of targetNamespace of an XML schema.
        In principle, each of the various namespaces ns used in the document should be assigned to a schema having ns as its targetNamespace,
        in the form of a xsi:schemaLocation attribute with value an alternating list of namespace URI / path-to-schema-file (separated by a blank space).
        Example given in the W3C Recommendation :

         <stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
                    xmlns:html="http://www.w3.org/1999/xhtml"
                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                    xsi:schemaLocation="http://www.w3.org/1999/XSL/Transform
                                        http://www.w3.org/1999/XSL/Transform.xsd

                                        http://www.w3.org/1999/xhtml
                                        http://www.w3.org/1999/xhtml.xsd
        ">



        However,
        • I did not find any working example anywhere with more than one (namespace URI / path-to-schema-file) pair.
        • Reliable tutorials such as Liquid Technologies XML Schema Tutorial, Part 4 : Using XML Schema Namespaces
          use only the (namespace URI / path-to-schema-file) pair relative to the root tag of the document.
        • Trying to use more than one pair, as suggested by the W3C Recommendation, fails.

        Therefore, I suggest to disregard the official recommendation, and to rule that :
        the xsi:schemaLocation attribute is valued with a string made up of the namespace URI of the root tag namespace followed by the path to the schema having the said namespace as its targetNamespace.

        Our favorite example :

        <?xml version="1.0" encoding="utf-8"?>
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:epita="http://epita/masters/international/"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 xsi:schemaLocation=
                 "http://www.w3.org/1999/02/22-rdf-syntax-ns# ../../../XSD-NS/RDF/NNrdf.xsd"

                 xml:base="http://epita/masters/international/perso">
          <rdf:Description rdf:about="#Elisabeth"><epita:mark>07</epita:mark></rdf:Description>
          <rdf:Description rdf:about="#Luc"><epita:mark>12</epita:mark></rdf:Description>
          <rdf:Description rdf:about="#Maurice"><epita:mark>18</epita:mark></rdf:Description>
          <rdf:Description rdf:about="#Juliette"><epita:mark>07</epita:mark></rdf:Description>
        </rdf:RDF>



      • Cautionary note
        The XML Schema Recommendation explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation attributes is only a hint; it does not mandate that these attributes must be used to locate schemas.

        This is one more indication of the evolution from strategy #2 to strategy #1.

    2. Schema-based validation by the parser is also possible with the DOM API.

      The technique replicates the one used for DTD-based validation, with two additions :

      1. The DocumentBuilderFactory that generates the parser must additionnally receive the setNamespaceAware(true) command.
        This is mandatory since the document uses the http://www.w3.org/2001/XMLSchema-instance namespace.
        On namespace awareness, see Namespaces in DOM-Java.

      2. The same DocumentBuilderFactory  object must also be provided with an attribute "http://java.sun.com/xml/jaxp/properties/schemaLanguage"
        with a value indicating that he schema language is XML Schema (as opposed to RelaxNG) :
        dbf.setAttribute(
        "http://java.sun.com/xml/jaxp/properties/schemaLanguage",  XMLConstants.W3C_XML_SCHEMA_NS_URI
        );


      With the same conventions as earlier,  a typical sequence looks like this:

              DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
              setNamespaceAware(true)
              dbf.setValidating(true);
              dbf.setAttribute( "http://java.sun.com/xml/jaxp/properties/schemaLanguage", XMLConstants.W3C_XML_SCHEMA_NS_URI);
              DocumentBuilder parser = dbf.newDocumentBuilder();
              MyHandler handler = new MyHandler();
              parser.setErrorHandler(handler);
             
              Document doc = null;
              try{                           
                  Document doc =  parser.parse(fileIn);
                  handler.showWarnings(...);
                 ...deal with your  doc, knowing it is valid...
              }
             
      catch (ValidationException e) {
                      System.out.println("Invalid document ..."+e.getMessage());
              }

              catch (SAXException e) {
                      System.out.println("XML syntax error ..."+e.getMessage());
              }



      Note that the same implementation will work equally for documents with or without namespace.

      See two complete examples: