Tunde Adegbola kirjoitti 30. des. 2008 kello 00.26:
> Kindly give me some more information on the platform you are working
> in and how the relaxer is implemented.
Andrew Cunningham kirjoitti 30. des. 2008 kello 02.35:
> I'd be interested in the approaches Trond is taking as well for some
> Sudanese languages I'm working on.
Dear Tunde, Andrew and others
Here comes a short intro on what we do, dealing with morphology as
well as with the spellrelax issue itself. Before reading on, keep in
mind that hunspell is a good platform for making _automata_ (fine for
concatenative morphology without extensive morphophonological
processes), and that a hunspell enriched with some unicode
normalisation scheme thus will be the best way to keep your projects
happy.
What we do is making _transducers_, not automata. Automatota can be
seen as filters (return input or not), whereas transducer gives
different output (for input a, returns b or nothing). Typically, an
automaton returns "feet" for "feet" and nothing for the typo "fee5t"
This is all you need for a speller (which implements red lines in case
of no-feedback, and goes on to calculate correction candidates), but
for transducers you have more possibilities. For "feet" you may return
e.g. "foot+N+Pl" (a grammatical analysis), or "pieds" (a grammatically-
enriched dictionary). Transducers may be utilised in grammar checkers,
in machine translation, etc. etc.
Now to the presentation of our work:
We actually make two different spellers, one for MS Word, and one for
OpenOffice.
Basis:
======
We do our morphology as finite-state transducers. Our basic code is
written with Xerox tools, lexc, twolc, xfst (
http://fsmbook.com/).
As a part of those transducers we have a separate spellrelax file,
which acts as I described: For each glyph (or even set of letters,
like ä/æ) with several characters associated with it, we generate a
parallel form with the variant character. The file in question is a
tiny script (cf. attachment), and our transducers thus are tolerant
wrt. character variation. A parallel treatmen for English would e.g.
be to have ç and ô in the source code (façade, rôle), but then
tolerate c pro ç etc (always accept facade, role as well).
The method is thus: compile your relaxing transducer on top of your
ordinary transducer:
yourlanguage.fst .o. spellrelax.fst = yourspellrelaxedlanguage.fst
where the files are transducers:
yourlanguage.fst reads "geese" and gives "goose+N+Pl"
spellrelax.fst reads ä and interprets it as both ä and æ
the combination of the two transducers will give a reading for both
"täksta" and "tæksta" (two varieties of the word for "text", although
both are found in the lexicon.
But if, say a Swedish name (Gävle) incorrectly is added as "Gævle",
the system will return an error, since the base contains "tæksta,
Gävle", and only ä is relaxed, not æ.
The book reffered to at
http://fsmbook.com/ will discuss this
(examples for two versions of Portuguese), and our documentation pages
(
http://giellatekno.uit.no/doc/index.html search for "spellrelax"
discusses our implementation of it).
The resulting transducers are then converted into spellcheckers along
different paths.
(our spellers are downloadable from
http://divvun.no)
MS Word:
========
Here, we needed to go to a Microsoft subcontractor, in our case
Polderland. Polderlands file format was a fullform
OpenOffice:
===========
Here, we use hunspell.
Hunspell is, as we see it, an enriched version of ispell, hence an
automaton (returns input or not), and not a transducer (for input a,
returns b or nothing). Our transducers are cascaded, so that we can
handle suffixation in one component and inner inflection (umlaut-like
processes) in another. An automaton forces us to "cut early" (I use
German Umlaut as example, assuming it to be more familiar, the actual
processes we model are Sami consonant gradation, but for the principle
behind the differences do not matter):
Kind + er -> Kinder
Buch + er -> Bücher
In our approach, we give both the same suffix -er, but Buch is given
an UML "suffix" as well, and then we have an umlaut rule in a parallel
transducer:
Buch + UML + er -> BuchUMLer (via rule u:ü <=> _ * UML ;) -> Bücher
With simple automata, Buch must have the single-letter "stem" B, and
then adds either "uch" or "ücher".
Rather than writing a separate automaton for hunspell, we enrich our
transducer with hunspell-type continuation-lexicon marks, in a version
which keeps the UML etc marks (hunspell then generates the stems Buch
and BuchUML (with different continuation lexica, the latter to /er,
the former not), then we use our morphophonological transducer to
change the stem BuchUML to Büch, and we have the hunspell automaton.
The spellrelax issue (the question you asked) we fix in the same way
as we did for the MS Word version: The transducer
yourspellrelaxedlanguage.fst generates both forms, i.e. (see above:
tæksta, täksta, Gävle, ...) and all their hunspell continuation lexica.
Ideally, we would like to use our transducers directly as
spellcheckers. This is what e.g. lingsoft.fi does, and if you want an
open source version for that you might try the Stuttgart sfst or
Helsinki hfst transducer platforms. Then we would not need to generate
double lists at all, we would just induce the spellrelax.fst as part
of the filter.
---
As you can see from our homepage:
http://divvun.no (our spellers) and
http://giellatekno.uit.no
(our transducers), we already work with several languages, and we
are as a matter of fact interested in looking at African lgs as well
(that is why I follow this list).
Arvi Hurskainen of course has a very advanced solution for Kiswahili,
and several of his students work on other Bantu lgs, as do Sonja
Borsch and Laura Pretorius in SA. I have (either alone or with
collegues and students) looked at Kinyarwanda, Nama and Amharic (and
there are good analyses of the Amharic object conjugation within this
framework)
The methods I have described here go back to Kimmo Koskenniemi's
trailblazing dissertation from 1983, it is no coincidence that a
Finnish scholar was the one to solve problems for morphology-rich
languages. Such languages are abundant in Africa, and in my opinion
approaches along (some version of) the lines sketched here should form
the cornerstone of all lg technology work on morphology-rich
languages, in Africa and elsewhere
Feel free to contact me for follow-ups.
Trond.
_______________________________________________
A12n-collaboration mailing list
A12n-collaboration@...
http://lists.kabissa.org/mailman/listinfo/a12n-collaboration