Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Re: Training LingPipe [w. new corpus format]   Message List  
Reply | Forward Message #68 of 796 |

> Also, I'm guessing that there are about 30,000
> entities in these documents...Is that enough to use for retraining?

Yes, that should be more than enough. Are you
tagging different kinds of entities? The more entity
types there are, the more training data you need to
discriminate between them.

> What I need to know is how to retrain LingPipe using the documents
> that we've tagged already.

There are two ways to go, and the first is probably
the easiest and can be done in any programming language
or even by hand.

1. Convert your data into a format
LingPipe understands, such as the CoNLL
line-based text format or the MUC XML format.

I'd strongly recommend XML because it
deals with character set and well-formedness
issues for you and you don't have to worry
about tokenization. All of the other tools
in LingPipe can work with XML data.

All that's required for the MUC XML format is
that you produced well-formed XML overall
and mark entities within text content as follows:

The <ENAMEX TYPE="ORGANIZATION">USAir</ENAMEX> flight attendant in the rear
of the plane making a
short flight to <ENAMEX TYPE="LOCATION">Charlotte</ENAMEX>, <ENAMEX
TYPE="LOCATION">N.C.</ENAMEX>, kept peeking around the corner of
a seat in Row 21, making 9-month-old <ENAMEX TYPE="PERSON">Danasia
Brown</ENAMEX> laugh.

You can use your own set of entity types and
the commands in LingPipe are very flexible
as to where they grab text content to process.

The downside is that you need another format
for your data and this kind of duplication is
pretty ugly. If possible, produce the XML in
our format automatically so it's easy to maintain
the link between your data and the XML. You
might correct your data, add more, etc. etc.

2. Write an implementation of TagParser for
your data format. The commands allow different
tag parsers to be plugged in through reflection.
Basically, a tag parser needs to
return two parallel arrays, one representing
tokens and one tags. The tags need to follow
the BIO format for start-entity, in-entity, and
out tags. If you think you want to do this,
you could send me a sample tagged document and
I could lend you a hand.

- Bob Carpenter
Alias-i




Wed Mar 9, 2005 10:29 pm

colloquialdo...
Offline Offline
Send Email Send Email

Forward
Message #68 of 796 |
Expand Messages Author Sort by Date

... Yes, that should be more than enough. Are you tagging different kinds of entities? The more entity types there are, the more training data you need to ...
Bob Carpenter
colloquialdo...
Offline Send Email
Mar 9, 2005
10:28 pm

... All of our documents are in XML format, but we are using TEI notation. Using NoteTabPro I wrote a simple clip that converts MUC to TEI notation. I believe...
jkriil
Offline Send Email
Mar 10, 2005
2:20 pm

... Great. That's the hardest part of the whole process. I have to do it all the time for evaluations. I looked up TEI and found more than I bargained for. I...
Bob Carpenter
colloquialdo...
Offline Send Email
Mar 10, 2005
5:58 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help