There are several parts to this answer.
1. Well-formed XML
> > [Fatal Error] crsto12.txt:414:54: The entity name must immediately
> > follow the '& ' in the entity reference.
> > Exception processing
> > file=data\raw\crsto12.txtorg.xml.sax.SAXParseException: The
> > entity name must immediately follow the '&' in the entity
> reference.
> >
> > Basically, it is trying parse 'Morrel & Son'
You need to replace instances of
'&', '<', '>' or '"'
with entity references
"&", "<", ">" and """
respectively.
2. Plain old text
For processing "The Count of Monte Cristo", you could've
used just plain old text depending on the command you're
using.
-contentType=text/plain
You can also declare the character set here, though
that didn't seem to be in the doc (I just added it for
the next release):
-contentType=text/plain charset=UTF-16
Depending on your invocation setup, you might need
to use quotes to make this a single argument. Ant
does this automatically, but most shells don't.
3) Genre Matching and Performance
Now, having said all of that, "The Count of Monte
Cristo" is not the kind of text for which LingPipe
was developed. The performance derived from named-entity
annotation is directly related to how closely the
text being processed matches the training text. So
it should do well on news of the sort found on the
AP newswire, in the NY Times, etc.
LingPipe can be trained to find other kinds of
entities in other genres. For instance, we've used it
to find gene names, protein and species names
in biomedical texts.
- Bob
Bob Carpenter carp@...
181 North 11th Street, #401, Brooklyn, NY 11211
Vox: (718) 290 9170 Fax: (719) 290 9171