Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Problem   Message List  
Reply | Forward Message #62 of 796 |
XML Entity Problem, contentType and Genre Matching [was Re: Problem]

There are several parts to this answer.

1. Well-formed XML

> > [Fatal Error] crsto12.txt:414:54: The entity name must immediately
> > follow the '& ' in the entity reference.
> > Exception processing
> > file=data\raw\crsto12.txtorg.xml.sax.SAXParseException: The
> > entity name must immediately follow the '&' in the entity
> reference.
> >
> > Basically, it is trying parse 'Morrel & Son'

You need to replace instances of

'&', '<', '>' or '"'

with entity references

"&amp;", "&lt;", "&gt;" and "&quot;"

respectively.


2. Plain old text

For processing "The Count of Monte Cristo", you could've
used just plain old text depending on the command you're
using.

-contentType=text/plain

You can also declare the character set here, though
that didn't seem to be in the doc (I just added it for
the next release):

-contentType=text/plain charset=UTF-16

Depending on your invocation setup, you might need
to use quotes to make this a single argument. Ant
does this automatically, but most shells don't.


3) Genre Matching and Performance

Now, having said all of that, "The Count of Monte
Cristo" is not the kind of text for which LingPipe
was developed. The performance derived from named-entity
annotation is directly related to how closely the
text being processed matches the training text. So
it should do well on news of the sort found on the
AP newswire, in the NY Times, etc.

LingPipe can be trained to find other kinds of
entities in other genres. For instance, we've used it
to find gene names, protein and species names
in biomedical texts.

- Bob

Bob Carpenter carp@...
181 North 11th Street, #401, Brooklyn, NY 11211
Vox: (718) 290 9170 Fax: (719) 290 9171





Tue Jan 18, 2005 5:53 pm

colloquialdo...
Online Now Online Now
Send Email Send Email

Forward
Message #62 of 796 |
Expand Messages Author Sort by Date

Hi, I downloaded 'The Count of Monte Cristo' from www.gutenberg.org. I inserted appropriate tags and used lingpipe process it. Here is the error i got. [Fatal...
gargnavendu
Offline Send Email
Mar 30, 2004
1:48 am

... Navendu, The input format has to be in valid xml. You need to get an xml validator--generally findable on the web for free and use that. To check your...
reckb
Offline Send Email
Apr 3, 2004
2:27 pm

Hi, Try escaping all XML types of characters coming in since it appears LingPipe is trying to treat your document as XML. The way to do this is to convert all...
Kepler
harpman62
Offline Send Email
Jan 18, 2005
4:57 am

There are several parts to this answer. 1. Well-formed XML ... You need to replace instances of '&', '<', '>' or '"' with entity references "&amp;", "&lt;",...
Bob Carpenter
colloquialdo...
Online Now Send Email
Jan 18, 2005
5:54 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help