Search the web
Sign In
New User? Sign Up
metaphorical · The Metaphorical Web
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
SAXON XSLT2, non-XML input, event handlers   Message List  
Reply | Forward Message #419 of 439 |

As was said in the preface description of this Yahoo
Group this site is about XML technologies, amongst
a number of other things. One XML technology
which is powerful and increasingly popular is XSLT;
Extensible Style Language, Transformation.
Many of you are likely familiar with CSS, used for
styling web sites among other things, but may be
perhaps less familiar with use of XSL.

There are a number of processors which process
XSLT and there are a couple of versions of
XSLT, 1 and 2. In this article we'll look at
SAXON XSLT2.


I talked with Michael Kay at the GeoWeb 2008
conference about some ideas I had using XSLT2,
via his SAXON processor. I told him about an
error-correcting XML parser I had written back in
1999 and verified with him that SAXON is extensible
by means of writing Java code to perform the
extension processing.


SAXON implements, amongst the complete spec
of XSLT2, the unparsed-text function and tokenizing.
The unparsed-text function returns the content of
an external file in the form of a string. The purpose
of this function is to allow users to read / input
non-XML material. We will visit an example of
this need in the article on inputting material into
BIM (Building Information Modelling) which is
neither BIM-Schema conformant nor in XML
'format'. Most BIM data, in practice, appears
to be entered by virture of CAD / CAM
Architecture programs in use rather than by
table /data form entry. We will see how
OWL, RDF(S) may be used in conjunction with 3D
architectural drawings in the article on that topic.


Included with this article is an XSLT program
taken from the XSLT2 documentation which uses
both unparsed-text( ) and tokenize( ). The input
is a simple comma-separated variables file
containing mailing addresses. CSV files are not
XML files and ordinarily an XML system cannot
read them, which is why unparsed-text( ) is of
interest. XSLT2 also has the document( ) function
which is analogous (with unparsed-text( ) ) but
it is designed to input material that is treated as
XML rather than text.


The error-correcting XML parser that I wrote
a decade ago used Java to input character (text)
information using the SAX method. By using
call-backs / handlers it was possible to micro-manage
the parsing of input material. It was already
known that a large percentage (60+%) of the input
documents (corporate Securities Exchange
Commission documents) were 'broken'
('non-conformant') in one or more ways and
that simple-minded straight ahead XML parsing simply
returned a failure message soon after touching said
documents. The simple-minded systems
bascially threw up their hands and quit upon the
first of any kind of error. This is not useful or
acceptable in a business environment / system
where the SEC documents MUST be read no matter
how attrociously they are coded / formatted.


I took a trick from the C language where you
can input one character at a time from the input
source and if for some reason you didn't like
what (character) you read you could (in effect) 'put
it back' (into the input), effectively 'unreading it'.
Then you could read it again with a new
strategy. While the SEC documents were
supposed to be couched in angle bracketed tags and
be 'valid documents' (according to the official
schema) often bits and pieces, such as a closing angle
bracket etc, were missing causing a standard XML
system to fall over a cliff. What I did was to provide
an input micro-manager which drove the reading.
It was encoded such that it 'knew the rules' of
what was supposed to be there (in the input stream)
and how to look for it (conditional testing).
Since it knew what was 'correct' and therefore
'supposed to be there' it could suggest / make
corrections to bad / missing stuff, like those
angle brackets and misplaced and mis-typed data.


SAXON allows one to write one's own handlers in
Java and operate with SAXON if one so chooses.
We will visit an example where a Java SAXON
extension is used to assist in the parsing of natural
language non-XML input. Looking at the included
code example we see that it appears to have been
written solely to address the file that is input in the
example. I am not faulting the program I'm
simply pointing out that the program handles
exactly the input file and also that the input file is
not flawed or broken. The program is written by
a human who knows XSLT2 and the nature
/details of the input file and hence can tune the
code he / she creates to handle it exactly
without any extra stuff to handle errors
and contingencies. No fault of the code is intended,
simply that the code is brief, purpose built and
with complete foreknowledge of the input file.
These points are discussed further in the
articles on aspect oriented programming and on
metaprogramming where programs write / generate
other programs (instead of being written by
human programmers).


The example program is designed to output
HTML (as opposed to XML or text). It produces a
table and uses XSL for-each to cycle through each
'line' of input. Such lines are akin to 'records'
in legacy systems. Each line is tokenized based on
the comma character, given explicitly. This code
is written to handle a COMMA separated
variables file. Blanks, hyphens, periods, semicolons,
colons, slashes, tabs are out. Purpose built code.
Following the tokenization is a series of
subsequence( ), which returns part of an input
sequence, where the start position and length of
the sub-sequence are given as integer arguments.
(Actually the arguments are xs:double). The
substring( ) function exists to do the same kind
of thing with string (data type). We notice that
the sequence of subsequence( ) code lines
'conveniently' have just the right hard-coded arguments
in the functions to exactly and correctly handle
the comma-separated variables address input.


What would happen if the input was different
from what the program was written to do? What if
some of the commas were slashes and
semi-colons (etc). What would happen if the zip code
was from Canada or England, where there are
letters in the postal code as well as digits? What would
the XSLT2 code look like to handle these
non-simple , perhaps unexpected vagaries of input?
These questions are important in the situation
where a BIM worker needs to input a customer's
building data file but (the client) provides a file
which is partly or even completely
'non-conformant' with established (BIM) schemas.
He can't tell his customers that their data
must only look like so and so.
The guy would be selling pencils on the
sidewalk pretty quick if he did
that in the (real) business world.




Mon May 11, 2009 9:27 am

david_dodds_...
Offline Offline
Send Email Send Email

Forward
Message #419 of 439 |
Expand Messages Author Sort by Date

As was said in the preface description of this Yahoo Group this site is about XML technologies, amongst a number of other things. One XML technology which is...
david_dodds_2001
david_dodds_...
Offline Send Email
May 11, 2009
9:42 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help