Ok, of 2 missing Stax 1.0 features, one is now fully implemented. The
next release (0.9.4) will contain fully functioning
namespace-repairing mode for XMLStreamWriter; it passes all staxtest
and stax2test test cases.
The next immediate task will be implementing the last major feature,
coalescing mode. After this, 1.0 could be finalized.
However, I think it would also make sense to add one more task for
1.0: formalize API to use for feeding non-blocking variant of
XMLStreamReader. Unlike blocking readers that can take in InputStream
or Reader (and references that can be used to create these),
non-blocking reader will not read any of its input. Rather, calling
app has to feed it new chunks of content once parser is done with the
current chunk.
Currently non-blocking parser prototype works as follows:
---
InputStream in = new FileInputStream(file); // just to
generate the input, usually would be NIO-based
final byte[] buf = new byte[3000];
ReaderConfig cfg = new ReaderConfig();
cfg.setActualEncoding("UTF-8"); // no encoding auto-detect yet
(will be added)
// will need a factory, can't use XMLInputFactory as is
AsyncUtfScanner asc = new AsyncUtfScanner(cfg);
StreamReaderImpl sr = new StreamReaderImpl(asc);
while (true) {
int type;
// We will feed chunked input 3 bytes at a time, for
test/demo purposes (even one byte would work)
while ((type = sr.next()) == AsyncByteScanner.EVENT_INCOMPLETE) {
int len = in.read(buf, 1, 3);
if (len < 0) { // shouldn't happen in the middle of
partial token
System.err.println("Error: Unexpected EOF");
break main_loop;
}
asc.addInput(buf, 1, len);
}
if (type == END_DOCUMENT) { // to trigger this, caller
must signal actual end of input
break;
}
// otherwise, handle the token; all data is available
without blocking
}
---
which clearly is not ready for production use, wires sticking out the
rat's nest kinda box. :-)
But the basic idea is simple: caller needs to handle EVENT_INCOMPLETE
return type, feed more data, indicate end of input when appropriate
(which may throw an exception etc), but otherwise work normally.
Once non-incomplete event is returned, all data associated will be
available without blocking.
Memory usage will be bounded by amount of memory needed for the single
event (and some state for nesting), and specifically length of
individual text segments will be limited to chunk size that
application gives. That is, CHARACTERS/CDATA is returned as soon as at
least one character has been decoded (and up to contents of the whole
chunk passed).
Using such a non-blocking parser, it should be quite easy to build a
single-threaded (or, N-threaded for N cores/CPUs) xml input handling
server; and one that would perform nicely and could apply elaborate
throttling if need be.
One more thing that would be good to investigate is how easy it would
be to implement SAX API for non-blocking stream reader. That should
not be very hard -- blocking stream reader can already be used as a
SAX parser via JAXP (or directly).
Thoughts, comments, suggestions?
-+ Tatu +-