|
Re: Training LingPipe [w. new corpus format]
> 1. Convert your data into a format
> LingPipe understands, such as the CoNLL
> line-based text format or the MUC XML format.
All of our documents are in XML format, but we are using TEI notation.
Using NoteTabPro I wrote a simple clip that converts MUC to TEI
notation. I believe I can do the reverse just as easily. So the
conversion shouldn't be a problem.
Once the xml file is in the MUC format, do I need to modify it any
further? And what about headers? We have a lot of archival
information in the xml header. Should I delete the header for
training? Also, can I use multiple xml files..or should I combine all
the documents together for training purposes. And finally what is the
command I need to use to train? Is it this?
java NETrainCommand -MUC myTrainingFile.xml
As for the second option...it was all gibberish to me, so I think the
first way is better. Thanks a lot for your help
Jason Goltermann
|