Search the web
Sign In
New User? Sign Up
LingPipe
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Re: Training LingPipe [w. new corpus format]   Message List  
Reply | Forward Message #70 of 796 |
RE: [LingPipe] Re: Training LingPipe [w. new corpus format]


> > 1. Convert your data into a format
> > LingPipe understands, such as the CoNLL
> > line-based text format or the MUC XML format.
>
> All of our documents are in XML format, but we are using TEI
> notation. Using NoteTabPro I wrote a simple clip that
> converts MUC to TEI notation. I believe I can do the reverse
> just as easily. So the conversion shouldn't be a problem.

Great. That's the hardest part of the whole process.
I have to do it all the time for evaluations.

I looked up TEI and found more than I bargained for.
I didn't see a standard way of marking up named entities,
but there was an awful lot of material on document metadata.
If there is, I can write a tag parser for the format
for the next LingPipe release.

> Once the xml file is in the MUC format, do I need to modify
> it any further? And what about headers? We have a lot of
> archival information in the xml header. Should I delete the
> header for training?

If there's text content that you don't want to train on,
then yes, you should delete it for training. That can
also be done programatically with the right filters,
but it sounds like it'll just be easiest for you to
get rid of the parts you don't want annotated.

> Also, can I use multiple xml files..or
> should I combine all the documents together for training
> purposes.

They need to be specified on the command line, so I
don't think 3600 will work in separate files that way.
LingPipe itself is more flexible than the command lets
on -- there's no limit to the number of files, per se,
just to how many characters can be on a command line.


> And finally what is the command I need to use to
> train? Is it this?
>
> java NETrainCommand -MUC myTrainingFile.xml

Not quite, but close. The getting started instructions
go through an example for CoNLL formatted data. They're
distributed as a text file and also available online at:

http://www.alias-i.com/lingpipe/getting_started.html

You just need to change the tagparser specification to MUC:

$JAVA
-server
-Xmx512M
-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
-cp $LINGPIPE;$XERCES_API;$XERCES_IMPL
com.aliasi.ne.command.NETrainCommand
-model=modelFileName
-tagParser=com.aliasi.ne.muc.TagParserMUC
TrainingFile1 ... TrainingFileN

I'm using $JAVA as in the getting started doc to be
a pointer to the java executable. You should use
the "-server" option if you're using Sun's 1.4 or 1.5
JDKs. The "-Xmx512M" allows java to use up to 512 megabytes
of RAM. The next line tells it which XML parser to use.
The third line indicates the classpath, and the args
should be paths to xercesImpl.jar, xml-apis.jar and
lingpipe_1_0_7.jar. That should be all you need.

You may not need that much memory, so feel free to try
it with a lower cap if you don't have 512 megs.

This will give you default values for the tokenizer and
token categorizer, as well as for the model paramters.
Often tuning the parameters can lead to a boost in
performance, with bigger boosts accruing to poorer performing
systems.

Then you can use the model just as specified in the other
commands. For instance, you can parse docs and specify
which XML elements have their text content annotated.
I'm afraid the default output from the commands is also
in MUC format. That could be changed by rewriting code,
but not from our commands.

> As for the second option...it was all gibberish to me, so I
> think the first way is better. Thanks a lot for your help

Sorry. The second option is for Java wonks to build efficient
document processing server deployments.

- Bob




Thu Mar 10, 2005 5:58 pm

colloquialdo...
Offline Offline
Send Email Send Email

Forward
Message #70 of 796 |
Expand Messages Author Sort by Date

... Yes, that should be more than enough. Are you tagging different kinds of entities? The more entity types there are, the more training data you need to ...
Bob Carpenter
colloquialdo...
Offline Send Email
Mar 9, 2005
10:28 pm

... All of our documents are in XML format, but we are using TEI notation. Using NoteTabPro I wrote a simple clip that converts MUC to TEI notation. I believe...
jkriil
Offline Send Email
Mar 10, 2005
2:20 pm

... Great. That's the hardest part of the whole process. I have to do it all the time for evaluations. I looked up TEI and found more than I bargained for. I...
Bob Carpenter
colloquialdo...
Offline Send Email
Mar 10, 2005
5:58 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help