Validation by parser
trying to answer a question by Niels Dawartz
- Validation Strategy
- Two main strategies
- Evolution
- Java technique for DTD-based validation
- DTD-based validation by the parser is provided by the SAX
API,
- DTD-based validation by the parser is also possible
with
the
DOM API.
- Java technique for schema-based validation by the parser
- Attaching an XML schema to an XML document
- Schema-based validation by the parser is also possible
with
the
DOM API.
-
-
given a validator (operator), an object
(XML document to be validated) and a reference (DTD, XMLS
or RNG):
- The reference is chosen by the validator,
i.e. the validation gets two arguments : validate the object
against the reference
typically :
xmllint myFile.xml --dtdvalid myGrammar.dtd
xmllint myFile.xml --schema mySchema.xsd
xmllint myFile.xml --relaxng myGrammar.rng
- The reference is offered by the object
(e.g. by a <!DOCTYPE...>
declaration, or by a suitable attribute for XML Schemas, see later),
i.e. the validation gets only one arguments : validate the object
(against the reference to which it points)
typically :
xmllint --valid myFile.xml
This strategy opens the possibility of having the validation operated
by the parser, as we shall see in detail later.
Note that even if the object does offer a reference, two-argument
validation will automatically override the offered reference :
if myFile.xml
contains a <!DOCTYPE...>
declaration,
"xmllint myFile.xml --dtdvalid myGrammar.dtd
"
will use myGrammar.dtd
as reference;
- and the "illogical" invocation
"xmllint --valid
myFile.xml
--dtdvalid myGrammar.dtd
" will also use myGrammar.dtd
.
-
Recall that, as inherited from SGML, a DTD - or a reference to a
DTD - is an integral part of
the
XML document it describes.
- For XML files, there is a specific syntactic device
<!DOCTYPE...>
- For
org.w3c.dom.Document
objects, the
DOM includes an interface for DTD objects, called org.w3c.dom.DocumentType
,
together with a method createDocumentType(...)
in the org.w3c.dom.DOMImplementation
interface.
This is no longer the case for XML Schemas or RelaxNG
grammars.
Evolution is
clearly in favor of strategy #1 (reference chosen by the validator).
Strategy #2 (validation by the parser) is well documented for DTD-based
validation, it works also for XML schemas (as we shall see here), but
no longer for RelaxNG.
-
-
which appeared as early as 1998.
This involves a whole hierarchy of exceptions to deal with
- XML syntax errors, which make the file illegible, and
are therefore fatal :
org.xml.sax.SAXException
;
- DTD-related validity errors. These have a warning
status, to be collected as exhaustively as possible:
org.xml.sax.SAXParseException
.
There is also a org.xml.sax.ErrorHandler
interface for
collecting and displaying warnings about validity issues: to perform
properly, the SAX parser must be equipped with a handler object of type
ErrorHandler
.
The whole error-handling mechanism of the SAX API is reused in DOM
parser validation, for DTDs as well as for XML Schemas (see later).
-
- Note that the
parse()
method of the javax.xml.parsers.DocumentBuilder
parser does raise a SAXException
when it encounters an
XML syntax error.
A more correct realization of the main()
method of our first DOM
example would be
public static void
main(String[] args) throws Exception {
DocumentBuilder parser =
DocumentBuilderFactory.
newInstance().newDocumentBuilder();
Document doc = null;
try{
doc =
parser.parse(args[0]);
}catch
(org.xml.sax.SAXException e){
System.out.println("Syntax error : "+ e.getMessage());
}
if( doc != null ){
float avr =
average_1(doc);
System.out.println("\nAverage is : "+avr);
}
System.out.println("\nDone");
}//main
- To endow your DOM parser with a (DTD-based)
validating capacity,
proceed in two steps:
- The
javax.xml.parsers.DocumentBuilderFactory
that generates the parser must receive the setValidating(true)
command.
- The parser itself must be equipped with a handler
object of type
ErrorHandler
.
Suppose that
- your favorite implementation of the
ErrorHandler
interface is class MyHandler
,
- and that it provides a method
showWarnings()
which will raise a special ValidationException
if there
is any warning.
A typical sequence looks like this:
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
DocumentBuilder parser =
dbf.newDocumentBuilder();
MyHandler
handler = new MyHandler();
parser.setErrorHandler(handler);
Document doc = null;
try{
Document doc
= parser.parse(fileIn);
handler.showWarnings(...);
...deal
with your doc, knowing it is valid...
}
catch
(ValidationException e) {
System.out.println("Invalid document ..."+e.getMessage());
}
catch
(SAXException e) {
System.out.println("XML syntax error ..."+e.getMessage());
}
See a complete realization here : MyHandler
, ValidationException
,
Average_1D
,
and 3 test files NM1-1.xml
,
NM1-2.xml
,
NM1-3.xml
,
as well as the DTD.
Important note : It is of course good practice to
check for fileNotFound
exceptions.
If the DTD text happens to be incorrect, a SAXException
will be raised.
-
Schema-based validation by strategy #1 in Java is operated with the
javax.xml.validation
API. See here.
This section is devoted to the implementation of strategy #2.
As a note of caution, observe that the XML Schema Recommendation
explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation
attributes is only a hint; it does not mandate that these attributes
must be used to locate schemas.
-
- Without namespaces in the document
the path (or the URL) toward the schema file is indicated by an
attribute "xsi:noNamespaceSchemaLocation
" in the root tag,
with the namespace xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
.
Example :
<?xml version="1.0" encoding="UTF-8"
standalone="no"?>
<list
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation =
"../../NamesMarks/NM1-1.xsd" >
<student name="Luc" mark="12"/>
<student name="Hélène" mark="17"/>
......
</list>
- With namespaces
Please review Namespaces
in XMLS for the notion of targetNamespace
of an
XML schema.
In principle, each of the various namespaces ns used in the
document should be assigned to a schema having ns as its targetNamespace
,
in the form of a xsi:schemaLocation
attribute with value
an alternating list of namespace URI / path-to-schema-file (separated
by a blank space).
Example given in the W3C
Recommendation :
<stylesheet
xmlns="http://www.w3.org/1999/XSL/Transform"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/1999/XSL/Transform
http://www.w3.org/1999/XSL/Transform.xsd
http://www.w3.org/1999/xhtml
http://www.w3.org/1999/xhtml.xsd">
However,
- I did not find any working example anywhere with more
than one (namespace URI / path-to-schema-file) pair.
- Reliable tutorials such as Liquid Technologies XML Schema Tutorial, Part 4 :
Using XML Schema Namespaces
use only the (namespace URI / path-to-schema-file) pair relative to the
root tag of the document.
- Trying to use more than one pair, as suggested by the
W3C Recommendation, fails.
Therefore, I suggest to disregard the official recommendation, and to
rule that :
the xsi:schemaLocation
attribute is valued with a string
made up of the namespace URI of the root tag namespace followed by the
path to the schema having the said namespace as its targetNamespace
.
Our favorite example :
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:epita="http://epita/masters/international/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://www.w3.org/1999/02/22-rdf-syntax-ns#
../../../XSD-NS/RDF/NNrdf.xsd"
xml:base="http://epita/masters/international/perso">
<rdf:Description
rdf:about="#Elisabeth"><epita:mark>07</epita:mark></rdf:Description>
<rdf:Description
rdf:about="#Luc"><epita:mark>12</epita:mark></rdf:Description>
<rdf:Description
rdf:about="#Maurice"><epita:mark>18</epita:mark></rdf:Description>
<rdf:Description
rdf:about="#Juliette"><epita:mark>07</epita:mark></rdf:Description>
</rdf:RDF>
- Cautionary note
The XML Schema Recommendation explicitly states that the inclusion of schemaLocation/noNamespaceSchemaLocation
attributes is only a hint; it does not mandate that these attributes
must be used to locate schemas.
This is one more indication of the evolution from strategy #2 to
strategy #1.
-
The technique replicates the one used for DTD-based validation, with
two additions :
- The
DocumentBuilderFactory
that generates the parser must additionnally receive the setNamespaceAware(true)
command.
This is mandatory since the document uses the http://www.w3.org/2001/XMLSchema-instance
namespace.
On namespace awareness, see Namespaces in DOM-Java.
- The same
DocumentBuilderFactory
object must also be provided with an attribute "http://java.sun.com/xml/jaxp/properties/schemaLanguage
"
with a value indicating that he schema language is XML Schema (as
opposed to RelaxNG) :
dbf.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaLanguage",
XMLConstants.W3C_XML_SCHEMA_NS_URI
);
With the same conventions as earlier, a typical sequence looks
like this:
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
setNamespaceAware(true)
dbf.setValidating(true);
dbf.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaLanguage",
XMLConstants.W3C_XML_SCHEMA_NS_URI);
DocumentBuilder parser =
dbf.newDocumentBuilder();
MyHandler
handler = new MyHandler();
parser.setErrorHandler(handler);
Document doc = null;
try{
Document doc
= parser.parse(fileIn);
handler.showWarnings(...);
...deal
with your doc, knowing it is valid...
}
catch
(ValidationException e) {
System.out.println("Invalid document ..."+e.getMessage());
}
catch
(SAXException e) {
System.out.println("XML syntax error ..."+e.getMessage());
}
Note that the same implementation will work equally for documents with
or without namespace.
See two complete examples: