Re: [jena-dev] Embedding of well-formed XML literals in SPARQL results WITHOUT tag escaping
Stu de Tejas wrote:
> Howdy folks,
Stu,
Thanks for the comment on XML literals in the SPARQL results format. How
about sending a comment to the DAWG comments list :
public-rdf-dawg-comments@...?
Discussion inline ...
>
> I have a short java patch to offer those who face a problem similar
> to something we encountered recently. We are using ARQ to perform
> SPARQL queries on a model that contains literals holding blocks of
> well-formed XML content. (I wish we didn't HAVE to do that, but
> that's another story).
Can you post an example? What do you do about namespaces (and language tags)
between the results format and the embedded XML? The results use a
<sparql xmlns="
http://www.w3.org/2005/sparql-results#">
> As recently discussed here on the list,
> Jena RDF literals may be marked as well-formed XML using the datatype
> rdf:XMLLiteral (which is also reflected in the "wellFormedXML()"
> property of the java Literal object). This works fine; the
> hangup comes in when we do a SPARQL query that produces results of
> this type (in the SPARQL "XML-results" format -
>
http://www.w3.org/TR/rdf-sparql-XMLres/ ) ,
> and then pass those results to an XSL stylesheet.
>
> The problem is that (using ARQ 1.4) the literal XML blocks are
> tag-escaped when they written into the SPARQL results XML
> (i.e. "<tag>" becomes "<tag>"). The negative impact of
> this choice is that a downstream parser which is processing these
> results (e.g. an XSL processor) will treat the included block as
> text, not as XML nodes which are available for XPath selection and
> so on. In some cases (e.g. in a web-based editor, perhaps) this
> behavior may be what you want, but in our case it is not. Of course
> it is possible to workaround the problem by forcing a parse of
> the XML literal before passing it to the stylesheet, but the
> fact remains that we would really much rather have the
> the XML in our literals be "at the same level of escaping" as
> the XML describing the rest of the SPARQL result set. So, I wrote
> a very short patch to
>
> com.hp.hpl.jena.query.resultset.XMLOutputResultSet
>
> to change the behavior. I was pleased with how easy the change was to
> make, and that's why I decided to post it here. (It would make
> sense to me if the jena ResultSetFormatter allowed a flag called
> "escapeWellFormedLiterals" or somesuch to be be passed in.
> But I didn't implement all that, I just changed the default behavior
> for our ARQ installation.) To install the change, you need
> to either recompile ARQ, or put your patched version of
> XMLOutputResultSet.class ahead of ARQ.jar on the classpath.
>
> The actual edit required is this: Replace the single line at
> 187 saying "out.print(xml_escape..." with these contents:
>
> // BEGIN patch
> // BEFORE: ARQ 1-4 version had this single line
> // out.print(xml_escape(literal.getLexicalForm())) ;
> // AFTER: We check whether the contents are legit XML, and
> // avoid escaping if they are.
>
> String literalLexicalForm = literal.getLexicalForm();
> boolean wellFormed = literal.isWellFormedXML();
> String literalOutput = (wellFormed) ? literalLexicalForm
> : xml_escape(literalLexicalForm);
> out.print(literalOutput);
> // END patch
Firstly, note ARQ can do this very easily only because it does not use an XML
writer based on DOM or SAX or some such. As the SPARQL results format is so
simple, I just wrote the raw XML out (same for JSON) which guarantees streaming.
Reading results is based on StAX (or SAX). The ARQ result set reader will not
work on this result set - it takes the text from the <binding> as the literal
lexical form.
> Do others agree that "unescaped well-formed XML literals" should be a
> legitimate output mode for ARQ?
The effect you want is like rdf:parseType="Literal" in RDF/XML. This is very
complicated for the reader in the general case of XML namespaces and language
tags.
> Two very squishy datapoints:
>
> 1) My perusal of
http://www.w3.org/TR/rdf-sparql-XMLres/
> leaves me with the impression that the spec is open on this point.
It says:
"""
RDF Typed Literal S with datatype URI D
<binding><literal datatype="D">S</literal></binding>
"""
S is the lexical form of the literal and in the XML output it must be the
lexcial form, not some XML that will turn into the lexical form.
With a plain string, any < needs to be turned into an entity to hide it from
the XML parser.
Hence the escaping to put < (the > is not necessary but I prefer to) as
uninterpreted characters that, after entity replacement, put the characters of
the lexical form into the literal on reading.
The XML schema for the results format:
<xs:element name="literal">
<xs:complexType mixed="true">
<xs:attribute name="datatype" type="res:URI-reference"/>
<xs:attribute ref="xml:lang"/>
</xs:complexType>
</xs:element>
it's a complexType to allow the attribute. There is no sub <xs:element> and
XML schema are closed.
The RelaxNG is:
literal = element res:literal {
datatypeAttr?, xmlLang?,
text
}
It has a text body. (The RelaxNG was used to create the XML schema :-)
A design goal for the format was to support XML schema-driven processing. One
of the criticisms of RDF/XML is the lack of XML Schema for it. Arbitrary XML
is part of that.
My reading is that it is not open (but I know the design criteria as well so
it might bias my reading :-). It could be clearer in the TR - so if you could
send a comment to
public-rdf-dawg-comments@... that would be great.
ARQ will follow whatever is decided by DAWG.
> 2) I believe that in Sesame currently you CAN
> switch back and forth between these behaviors (But...uh...now I
> can't find the discussion page that made me think this).
The only control I could find is SPARQLResultsWriter.setPrettyPrint. If you
do find that discussion, could you forward it?
>
> Stu Baurmann
Andy