Validation by parser

trying to answer a question by Niels Dawartz

Validation Strategy
1. Two main strategies
2. Evolution
Java technique for DTD-based validation
1. DTD-based validation by the parser is provided by the SAX API,
2. DTD-based validation by the parser is also possible with the DOM API.
Java technique for schema-based validation by the parser
1. Attaching an XML schema to an XML document
2. Schema-based validation by the parser is also possible with the DOM API.

Validation Strategy
- Two main strategies
  given a validator (operator), an object (XML document to be validated) and a reference (DTD, XMLS or RNG):
  1. The reference is chosen by the validator,
    i.e. the validation gets two arguments : validate the object against the reference
    typically :
    - xmllint myFile.xml --dtdvalid myGrammar.dtd
    - xmllint myFile.xml --schema mySchema.xsd
    - xmllint myFile.xml --relaxng myGrammar.rng
  2. The reference is offered by the object
    (e.g. by a <!DOCTYPE...> declaration, or by a suitable attribute for XML Schemas, see later),
    i.e. the validation gets only one arguments : validate the object (against the reference to which it points)
    typically :
    xmllint --valid myFile.xml
    
    This strategy opens the possibility of having the validation operated by the parser, as we shall see in detail later.
  Note that even if the object does offer a reference, two-argument validation will automatically override the offered reference :
  if myFile.xml contains a <!DOCTYPE...> declaration,
  - "xmllint myFile.xml --dtdvalid myGrammar.dtd" will use myGrammar.dtd as reference;
  - and the "illogical" invocation "xmllint --valid myFile.xml --dtdvalid myGrammar.dtd" will also use myGrammar.dtd.
- Evolution
  Recall that, as inherited from SGML, a DTD - or a reference to a DTD - is an integral part of the XML document it describes.
  - For XML files, there is a specific syntactic device <!DOCTYPE...>
  - For org.w3c.dom.Document objects, the DOM includes an interface for DTD objects, called org.w3c.dom.DocumentType,
    together with a method createDocumentType(...) in the org.w3c.dom.DOMImplementation interface.
  This is no longer the case for XML Schemas or RelaxNG grammars.
  
  Evolution is clearly in favor of strategy #1 (reference chosen by the validator).
  Strategy #2 (validation by the parser) is well documented for DTD-based validation, it works also for XML schemas (as we shall see here), but no longer for RelaxNG.
Java technique for DTD-based validation
- DTD-based validation by the parser is provided by the SAX API,
  which appeared as early as 1998.
  
  This involves a whole hierarchy of exceptions to deal with
  - XML syntax errors, which make the file illegible, and are therefore fatal : org.xml.sax.SAXException;
  - DTD-related validity errors. These have a warning status, to be collected as exhaustively as possible: org.xml.sax.SAXParseException.
  There is also a org.xml.sax.ErrorHandler interface for collecting and displaying warnings about validity issues: to perform properly, the SAX parser must be equipped with a handler object of type ErrorHandler.
  
  The whole error-handling mechanism of the SAX API is reused in DOM parser validation, for DTDs as well as for XML Schemas (see later).
- DTD-based validation by the parser is also possible with the DOM API.
  1. Note that the parse() method of the javax.xml.parsers.DocumentBuilder parser does raise a SAXException when it encounters an XML syntax error.
    A more correct realization of the main() method of our first DOM example would be
    
    public static void main(String[] args) throws Exception { DocumentBuilder parser = DocumentBuilderFactory. newInstance().newDocumentBuilder(); Document doc = null; try{ doc = parser.parse(args[0]); }catch (org.xml.sax.SAXException e){ System.out.println("Syntax error : "+ e.getMessage()); } if( doc != null ){ float avr = average_1(doc); System.out.println("\nAverage is : "+avr); } System.out.println("\nDone"); }//main
  2. To endow your DOM parser with a (DTD-based) validating capacity, proceed in two steps:
    1. The javax.xml.parsers.DocumentBuilderFactory that generates the parser must receive the setValidating(true) command.
    2. The parser itself must be equipped with a handler object of type ErrorHandler.
    Suppose that
    - your favorite implementation of the ErrorHandler interface is class MyHandler,
    - and that it provides a method showWarnings() which will raise a special ValidationException if there is any warning.
    A typical sequence looks like this:
    
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();         dbf.setValidating(true);         DocumentBuilder parser = dbf.newDocumentBuilder();        MyHandler handler = new MyHandler();
            parser.setErrorHandler(handler);                Document doc = null;
            try{                                        Document doc = parser.parse(fileIn);             handler.showWarnings(...);            ...deal with your doc, knowing it is valid...         }        catch (ValidationException e) {                 System.out.println("Invalid document ..."+e.getMessage());         }        catch (SAXException e) {                 System.out.println("XML syntax error ..."+e.getMessage());         }
    
    See a complete realization here : MyHandler, ValidationException, Average_1D,
    and 3 test files NM1-1.xml, NM1-2.xml, NM1-3.xml, as well as the DTD.
    
    Important note : It is of course good practice to check for fileNotFound exceptions.
    If the DTD text happens to be incorrect, a SAXException will be raised.
Java technique for schema-based validation by the parser
Schema-based validation by strategy #1 in Java is operated with the javax.xml.validation API. See here.
This section is devoted to the implementation of strategy #2.
As a note of caution, observe that the XML Schema Recommendation explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation attributes is only a hint; it does not mandate that these attributes must be used to locate schemas.
1. Attaching an XML schema to an XML document
  - Without namespaces in the document
    the path (or the URL) toward the schema file is indicated by an attribute "xsi:noNamespaceSchemaLocation" in the root tag,
    with the namespace xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance".
    
    Example :
    <?xml version="1.0" encoding="UTF-8" standalone="no"?> <list xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation = "../../NamesMarks/NM1-1.xsd" > <student name="Luc" mark="12"/> <student name="Hélène" mark="17"/> ...... </list>
  - With namespaces
    Please review Namespaces in XMLS for the notion of targetNamespace of an XML schema.
    In principle, each of the various namespaces ns used in the document should be assigned to a schema having ns as its targetNamespace,
    in the form of a xsi:schemaLocation attribute with value an alternating list of namespace URI / path-to-schema-file (separated by a blank space).
    Example given in the W3C Recommendation :
    
    <stylesheet xmlns="http://www.w3.org/1999/XSL/Transform" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/XSL/Transform http://www.w3.org/1999/XSL/Transform.xsd http://www.w3.org/1999/xhtml http://www.w3.org/1999/xhtml.xsd">
    
    However,
    - I did not find any working example anywhere with more than one (namespace URI / path-to-schema-file) pair.
    - Reliable tutorials such as Liquid Technologies XML Schema Tutorial, Part 4 : Using XML Schema Namespaces
      use only the (namespace URI / path-to-schema-file) pair relative to the root tag of the document.
    - Trying to use more than one pair, as suggested by the W3C Recommendation, fails.
    Therefore, I suggest to disregard the official recommendation, and to rule that :
    the xsi:schemaLocation attribute is valued with a string made up of the namespace URI of the root tag namespace followed by the path to the schema having the said namespace as its targetNamespace.
    
    Our favorite example :
    
    <?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:epita="http://epita/masters/international/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://www.w3.org/1999/02/22-rdf-syntax-ns# ../../../XSD-NS/RDF/NNrdf.xsd" xml:base="http://epita/masters/international/perso"> <rdf:Description rdf:about="#Elisabeth"><epita:mark>07</epita:mark></rdf:Description> <rdf:Description rdf:about="#Luc"><epita:mark>12</epita:mark></rdf:Description> <rdf:Description rdf:about="#Maurice"><epita:mark>18</epita:mark></rdf:Description> <rdf:Description rdf:about="#Juliette"><epita:mark>07</epita:mark></rdf:Description> </rdf:RDF>
  - Cautionary note
    The XML Schema Recommendation explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation attributes is only a hint; it does not mandate that these attributes must be used to locate schemas.
    
    This is one more indication of the evolution from strategy #2 to strategy #1.
2. Schema-based validation by the parser is also possible with the DOM API.
  The technique replicates the one used for DTD-based validation, with two additions :
  1. The DocumentBuilderFactory that generates the parser must additionnally receive the setNamespaceAware(true) command.
    This is mandatory since the document uses the http://www.w3.org/2001/XMLSchema-instance namespace.
    On namespace awareness, see Namespaces in DOM-Java.
  2. The same DocumentBuilderFactory object must also be provided with an attribute "http://java.sun.com/xml/jaxp/properties/schemaLanguage"
    with a value indicating that he schema language is XML Schema (as opposed to RelaxNG) :
    dbf.setAttribute( "http://java.sun.com/xml/jaxp/properties/schemaLanguage", XMLConstants.W3C_XML_SCHEMA_NS_URI );
  With the same conventions as earlier, a typical sequence looks like this:
  
          DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();         setNamespaceAware(true)         dbf.setValidating(true);         dbf.setAttribute( "http://java.sun.com/xml/jaxp/properties/schemaLanguage", XMLConstants.W3C_XML_SCHEMA_NS_URI);         DocumentBuilder parser = dbf.newDocumentBuilder();        MyHandler handler = new MyHandler();
          parser.setErrorHandler(handler);                Document doc = null;
          try{                                        Document doc = parser.parse(fileIn);             handler.showWarnings(...);            ...deal with your doc, knowing it is valid...         }        catch (ValidationException e) {                 System.out.println("Invalid document ..."+e.getMessage());         }        catch (SAXException e) {                 System.out.println("XML syntax error ..."+e.getMessage());         }
  
  Note that the same implementation will work equally for documents with or without namespace.
  
  See two complete examples:
  - wihout namespace : MyHandler, ValidationException, Average_1XP,
    and 3 test files NM1-1.xml, NM1-2.xml, NM1-3.xml, as well as the schema.
  - with namespaces : Average_RVP, test file, schema.
    Note that this example requires the use of the whole set of DOM methods "with namespace".

Validation by parser

trying to answer a question by Niels Dawartz

DTD-based validation by the parser is provided by the SAX API,