|
> Also, I'm guessing that there are about 30,000
> entities in these documents...Is that enough to use for retraining?
Yes, that should be more than enough. Are you
tagging different kinds of entities? The more entity
types there are, the more training data you need to
discriminate between them.
> What I need to know is how to retrain LingPipe using the documents
> that we've tagged already.
There are two ways to go, and the first is probably
the easiest and can be done in any programming language
or even by hand.
1. Convert your data into a format
LingPipe understands, such as the CoNLL
line-based text format or the MUC XML format.
I'd strongly recommend XML because it
deals with character set and well-formedness
issues for you and you don't have to worry
about tokenization. All of the other tools
in LingPipe can work with XML data.
All that's required for the MUC XML format is
that you produced well-formed XML overall
and mark entities within text content as follows:
The <ENAMEX TYPE="ORGANIZATION">USAir</ENAMEX> flight attendant in the rear
of the plane making a
short flight to <ENAMEX TYPE="LOCATION">Charlotte</ENAMEX>, <ENAMEX
TYPE="LOCATION">N.C.</ENAMEX>, kept peeking around the corner of
a seat in Row 21, making 9-month-old <ENAMEX TYPE="PERSON">Danasia
Brown</ENAMEX> laugh.
You can use your own set of entity types and
the commands in LingPipe are very flexible
as to where they grab text content to process.
The downside is that you need another format
for your data and this kind of duplication is
pretty ugly. If possible, produce the XML in
our format automatically so it's easy to maintain
the link between your data and the XML. You
might correct your data, add more, etc. etc.
2. Write an implementation of TagParser for
your data format. The commands allow different
tag parsers to be plugged in through reflection.
Basically, a tag parser needs to
return two parallel arrays, one representing
tokens and one tags. The tags need to follow
the BIO format for start-entity, in-entity, and
out tags. If you think you want to do this,
you could send me a sample tagged document and
I could lend you a hand.
- Bob Carpenter
Alias-i
|