Thanks Trond.
I have read through your post and taken a look at your website and I find your work quite interesting. I believe ours has a lot to learn from yours.
In agreement and as has also been suggested by Friedel Wolf with whom we are collaborating on the Anloc Spell Checking project, Unicode normalization for Hunspel will be an important step in solving the order of diacritics problem in our spell checker.
However, reading through your post, it is clear that we need to look closer at what you are doing in order not to reinvent wheels in the process of solving some of our other more pressing problems. The morphological processes we are dealing with go far beyond mere concatenation. For example, some Yoruba verb phrases have the habit of transforming into words thereby producing another verb. In this process of making another word out of two (or sometimes more) words, some vowels are elided and tones are transferred from one vowel to the other. We have analyzed hundreds of these verb phrases using Prolog and we have by so doing developed a compact set of rules that guide these processes. But we have not been able to code such rules in Hunspel. We have in the last six months been in regular communication with Laszlo Nemeth, the developer of Hunspel and he has been very supportive, giving interesting tips to cajole Hunspel to do some things that it was not explicitly designed to do.
We have made useful progress on our project but we know there can be more efficient ways of doing some of the things we have done. I would like to keep in touch with you on these issues.
Tunde
> From: trond.trosterud@...
> To: a12n-collaboration@...
> Subject: Re: [A12n-Collab] Re:Tech support for Yoruba orthography
> Date: Wed, 31 Dec 2008 18:26:38 +0100
>
>
> Tunde Adegbola kirjoitti 30. des. 2008 kello 00.26:
> > Kindly give me some more information on the platform you are working
> > in and how the relaxer is implemented.
>
> Andrew Cunningham kirjoitti 30. des. 2008 kello 02.35:
> > I'd be interested in the approaches Trond is taking as well for some
> > Sudanese languages I'm working on.
>
>
> Dear Tunde, Andrew and others
>
> Here comes a short intro on what we do, dealing with morphology as
> well as with the spellrelax issue itself. Before reading on, keep in
> mind that hunspell is a good platform for making _automata_ (fine for
> concatenative morphology without extensive morphophonological
> processes), and that a hunspell enriched with some unicode
> normalisation scheme thus will be the best way to keep your projects
> happy.
>
> What we do is making _transducers_, not automata. Automatota can be
> seen as filters (return input or not), whereas transducer gives
> different output (for input a, returns b or nothing). Typically, an
> automaton returns "feet" for "feet" and nothing for the typo "fee5t"
> This is all you need for a speller (which implements red lines in case
> of no-feedback, and goes on to calculate correction candidates), but
> for transducers you have more possibilities. For "feet" you may return
> e.g. "foot+N+Pl" (a grammatical analysis), or "pieds" (a grammatically-
> enriched dictionary). Transducers may be utilised in grammar checkers,
> in machine translation, etc. etc.
>
> Now to the presentation of our work:
>
> We actually make two different spellers, one for MS Word, and one for
> OpenOffice.
>
>
> Basis:
> ======
>
> We do our morphology as finite-state transducers. Our basic code is
> written with Xerox tools, lexc, twolc, xfst (http://fsmbook.com/).
>
> As a part of those transducers we have a separate spellrelax file,
> which acts as I described: For each glyph (or even set of letters,
> like ä/æ) with several characters associated with it, we generate a
> parallel form with the variant character. The file in question is a
> tiny script (cf. attachment), and our transducers thus are tolerant
> wrt. character variation. A parallel treatmen for English would e.g.
> be to have ç and ô in the source code (façade, rôle), but then
> tolerate c pro ç etc (always accept facade, role as well).
>
> The method is thus: compile your relaxing transducer on top of your
> ordinary transducer:
>
> yourlanguage.fst .o. spellrelax.fst = yourspellrelaxedlanguage.fst
>
> where the files are transducers:
>
> yourlanguage.fst reads "geese" and gives "goose+N+Pl"
>
> spellrelax.fst reads ä and interprets it as both ä and æ
>
> the combination of the two transducers will give a reading for both
> "täksta" and "tæksta" (two varieties of the word for "text", although
> both are found in the lexicon.
> But if, say a Swedish name (Gävle) incorrectly is added as "Gævle",
> the system will return an error, since the base contains "tæksta,
> Gävle", and only ä is relaxed, not æ.
>
> The book reffered to at http://fsmbook.com/ will discuss this
> (examples for two versions of Portuguese), and our documentation pages
> (http://giellatekno.uit.no/doc/index.html search for "spellrelax"
> discusses our implementation of it).
>
>
> The resulting transducers are then converted into spellcheckers along
> different paths.
> (our spellers are downloadable from http://divvun.no)
>
> MS Word:
> ========
> Here, we needed to go to a Microsoft subcontractor, in our case
> Polderland. Polderlands file format was a fullform
>
>
> OpenOffice:
> ===========
> Here, we use hunspell.
>
> Hunspell is, as we see it, an enriched version of ispell, hence an
> automaton (returns input or not), and not a transducer (for input a,
> returns b or nothing). Our transducers are cascaded, so that we can
> handle suffixation in one component and inner inflection (umlaut-like
> processes) in another. An automaton forces us to "cut early" (I use
> German Umlaut as example, assuming it to be more familiar, the actual
> processes we model are Sami consonant gradation, but for the principle
> behind the differences do not matter):
>
> Kind + er -> Kinder
> Buch + er -> Bücher
>
> In our approach, we give both the same suffix -er, but Buch is given
> an UML "suffix" as well, and then we have an umlaut rule in a parallel
> transducer:
> Buch + UML + er -> BuchUMLer (via rule u:ü <=> _ * UML ;) -> Bücher
>
> With simple automata, Buch must have the single-letter "stem" B, and
> then adds either "uch" or "ücher".
>
> Rather than writing a separate automaton for hunspell, we enrich our
> transducer with hunspell-type continuation-lexicon marks, in a version
> which keeps the UML etc marks (hunspell then generates the stems Buch
> and BuchUML (with different continuation lexica, the latter to /er,
> the former not), then we use our morphophonological transducer to
> change the stem BuchUML to Büch, and we have the hunspell automaton.
>
> The spellrelax issue (the question you asked) we fix in the same way
> as we did for the MS Word version: The transducer
> yourspellrelaxedlanguage.fst generates both forms, i.e. (see above:
> tæksta, täksta, Gävle, ...) and all their hunspell continuation lexica.
>
> Ideally, we would like to use our transducers directly as
> spellcheckers. This is what e.g. lingsoft.fi does, and if you want an
> open source version for that you might try the Stuttgart sfst or
> Helsinki hfst transducer platforms. Then we would not need to generate
> double lists at all, we would just induce the spellrelax.fst as part
> of the filter.
>
> ---
>
> As you can see from our homepage: http://divvun.no (our spellers) and http://giellatekno.uit.no
> (our transducers), we already work with several languages, and we
> are as a matter of fact interested in looking at African lgs as well
> (that is why I follow this list).
>
> Arvi Hurskainen of course has a very advanced solution for Kiswahili,
> and several of his students work on other Bantu lgs, as do Sonja
> Borsch and Laura Pretorius in SA. I have (either alone or with
> collegues and students) looked at Kinyarwanda, Nama and Amharic (and
> there are good analyses of the Amharic object conjugation within this
> framework)
>
> The methods I have described here go back to Kimmo Koskenniemi's
> trailblazing dissertation from 1983, it is no coincidence that a
> Finnish scholar was the one to solve problems for morphology-rich
> languages. Such languages are abundant in Africa, and in my opinion
> approaches along (some version of) the lines sketched here should form
> the cornerstone of all lg technology work on morphology-rich
> languages, in Africa and elsewhere
>
> Feel free to contact me for follow-ups.
>
> Trond.
>
>
> _______________________________________________
> A12n-collaboration mailing list
> A12n-collaboration@...
> http://lists.kabissa.org/mailman/listinfo/a12n-collaboration
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! Try it!
I have read through your post and taken a look at your website and I find your work quite interesting. I believe ours has a lot to learn from yours.
In agreement and as has also been suggested by Friedel Wolf with whom we are collaborating on the Anloc Spell Checking project, Unicode normalization for Hunspel will be an important step in solving the order of diacritics problem in our spell checker.
However, reading through your post, it is clear that we need to look closer at what you are doing in order not to reinvent wheels in the process of solving some of our other more pressing problems. The morphological processes we are dealing with go far beyond mere concatenation. For example, some Yoruba verb phrases have the habit of transforming into words thereby producing another verb. In this process of making another word out of two (or sometimes more) words, some vowels are elided and tones are transferred from one vowel to the other. We have analyzed hundreds of these verb phrases using Prolog and we have by so doing developed a compact set of rules that guide these processes. But we have not been able to code such rules in Hunspel. We have in the last six months been in regular communication with Laszlo Nemeth, the developer of Hunspel and he has been very supportive, giving interesting tips to cajole Hunspel to do some things that it was not explicitly designed to do.
We have made useful progress on our project but we know there can be more efficient ways of doing some of the things we have done. I would like to keep in touch with you on these issues.
Tunde
-----------------------------------------------------------------------------------------------
Tunde Adegbola (Ph.D.)
Executive Director
African Languages Technology Initiative
(Alt-I ... Inserting African issues into the agenda of the knowledge age)
11 Oluyole Way, New Bodija Ibadan, Nigeria.
+234 8034019398
------------------------------------------------------------------------------------------------
> From: trond.trosterud@...
> To: a12n-collaboration@...
> Subject: Re: [A12n-Collab] Re:Tech support for Yoruba orthography
> Date: Wed, 31 Dec 2008 18:26:38 +0100
>
>
> Tunde Adegbola kirjoitti 30. des. 2008 kello 00.26:
> > Kindly give me some more information on the platform you are working
> > in and how the relaxer is implemented.
>
> Andrew Cunningham kirjoitti 30. des. 2008 kello 02.35:
> > I'd be interested in the approaches Trond is taking as well for some
> > Sudanese languages I'm working on.
>
>
> Dear Tunde, Andrew and others
>
> Here comes a short intro on what we do, dealing with morphology as
> well as with the spellrelax issue itself. Before reading on, keep in
> mind that hunspell is a good platform for making _automata_ (fine for
> concatenative morphology without extensive morphophonological
> processes), and that a hunspell enriched with some unicode
> normalisation scheme thus will be the best way to keep your projects
> happy.
>
> What we do is making _transducers_, not automata. Automatota can be
> seen as filters (return input or not), whereas transducer gives
> different output (for input a, returns b or nothing). Typically, an
> automaton returns "feet" for "feet" and nothing for the typo "fee5t"
> This is all you need for a speller (which implements red lines in case
> of no-feedback, and goes on to calculate correction candidates), but
> for transducers you have more possibilities. For "feet" you may return
> e.g. "foot+N+Pl" (a grammatical analysis), or "pieds" (a grammatically-
> enriched dictionary). Transducers may be utilised in grammar checkers,
> in machine translation, etc. etc.
>
> Now to the presentation of our work:
>
> We actually make two different spellers, one for MS Word, and one for
> OpenOffice.
>
>
> Basis:
> ======
>
> We do our morphology as finite-state transducers. Our basic code is
> written with Xerox tools, lexc, twolc, xfst (http://fsmbook.com/).
>
> As a part of those transducers we have a separate spellrelax file,
> which acts as I described: For each glyph (or even set of letters,
> like ä/æ) with several characters associated with it, we generate a
> parallel form with the variant character. The file in question is a
> tiny script (cf. attachment), and our transducers thus are tolerant
> wrt. character variation. A parallel treatmen for English would e.g.
> be to have ç and ô in the source code (façade, rôle), but then
> tolerate c pro ç etc (always accept facade, role as well).
>
> The method is thus: compile your relaxing transducer on top of your
> ordinary transducer:
>
> yourlanguage.fst .o. spellrelax.fst = yourspellrelaxedlanguage.fst
>
> where the files are transducers:
>
> yourlanguage.fst reads "geese" and gives "goose+N+Pl"
>
> spellrelax.fst reads ä and interprets it as both ä and æ
>
> the combination of the two transducers will give a reading for both
> "täksta" and "tæksta" (two varieties of the word for "text", although
> both are found in the lexicon.
> But if, say a Swedish name (Gävle) incorrectly is added as "Gævle",
> the system will return an error, since the base contains "tæksta,
> Gävle", and only ä is relaxed, not æ.
>
> The book reffered to at http://fsmbook.com/ will discuss this
> (examples for two versions of Portuguese), and our documentation pages
> (http://giellatekno.uit.no/doc/index.html search for "spellrelax"
> discusses our implementation of it).
>
>
> The resulting transducers are then converted into spellcheckers along
> different paths.
> (our spellers are downloadable from http://divvun.no)
>
> MS Word:
> ========
> Here, we needed to go to a Microsoft subcontractor, in our case
> Polderland. Polderlands file format was a fullform
>
>
> OpenOffice:
> ===========
> Here, we use hunspell.
>
> Hunspell is, as we see it, an enriched version of ispell, hence an
> automaton (returns input or not), and not a transducer (for input a,
> returns b or nothing). Our transducers are cascaded, so that we can
> handle suffixation in one component and inner inflection (umlaut-like
> processes) in another. An automaton forces us to "cut early" (I use
> German Umlaut as example, assuming it to be more familiar, the actual
> processes we model are Sami consonant gradation, but for the principle
> behind the differences do not matter):
>
> Kind + er -> Kinder
> Buch + er -> Bücher
>
> In our approach, we give both the same suffix -er, but Buch is given
> an UML "suffix" as well, and then we have an umlaut rule in a parallel
> transducer:
> Buch + UML + er -> BuchUMLer (via rule u:ü <=> _ * UML ;) -> Bücher
>
> With simple automata, Buch must have the single-letter "stem" B, and
> then adds either "uch" or "ücher".
>
> Rather than writing a separate automaton for hunspell, we enrich our
> transducer with hunspell-type continuation-lexicon marks, in a version
> which keeps the UML etc marks (hunspell then generates the stems Buch
> and BuchUML (with different continuation lexica, the latter to /er,
> the former not), then we use our morphophonological transducer to
> change the stem BuchUML to Büch, and we have the hunspell automaton.
>
> The spellrelax issue (the question you asked) we fix in the same way
> as we did for the MS Word version: The transducer
> yourspellrelaxedlanguage.fst generates both forms, i.e. (see above:
> tæksta, täksta, Gävle, ...) and all their hunspell continuation lexica.
>
> Ideally, we would like to use our transducers directly as
> spellcheckers. This is what e.g. lingsoft.fi does, and if you want an
> open source version for that you might try the Stuttgart sfst or
> Helsinki hfst transducer platforms. Then we would not need to generate
> double lists at all, we would just induce the spellrelax.fst as part
> of the filter.
>
> ---
>
> As you can see from our homepage: http://divvun.no (our spellers) and http://giellatekno.uit.no
> (our transducers), we already work with several languages, and we
> are as a matter of fact interested in looking at African lgs as well
> (that is why I follow this list).
>
> Arvi Hurskainen of course has a very advanced solution for Kiswahili,
> and several of his students work on other Bantu lgs, as do Sonja
> Borsch and Laura Pretorius in SA. I have (either alone or with
> collegues and students) looked at Kinyarwanda, Nama and Amharic (and
> there are good analyses of the Amharic object conjugation within this
> framework)
>
> The methods I have described here go back to Kimmo Koskenniemi's
> trailblazing dissertation from 1983, it is no coincidence that a
> Finnish scholar was the one to solve problems for morphology-rich
> languages. Such languages are abundant in Africa, and in my opinion
> approaches along (some version of) the lines sketched here should form
> the cornerstone of all lg technology work on morphology-rich
> languages, in Africa and elsewhere
>
> Feel free to contact me for follow-ups.
>
> Trond.
>
>
> _______________________________________________
> A12n-collaboration mailing list
> A12n-collaboration@...
> http://lists.kabissa.org/mailman/listinfo/a12n-collaboration
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! Try it!