Search the web
Sign In
New User? Sign Up
a12n-archives · A12n = Africanization of ICT
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
[A12n-Collab] Re:Tech support for Yoruba orthography   Message List  
Reply | Forward Message #930 of 1132 |
Re: [A12n-Collab] Re:Tech support for Yoruba orthography


Tunde Adegbola kirjoitti 30. des. 2008 kello 00.26:
> Kindly give me some more information on the platform you are working
> in and how the relaxer is implemented.

Andrew Cunningham kirjoitti 30. des. 2008 kello 02.35:
> I'd be interested in the approaches Trond is taking as well for some
> Sudanese languages I'm working on.


Dear Tunde, Andrew and others

Here comes a short intro on what we do, dealing with morphology as
well as with the spellrelax issue itself. Before reading on, keep in
mind that hunspell is a good platform for making _automata_ (fine for
concatenative morphology without extensive morphophonological
processes), and that a hunspell enriched with some unicode
normalisation scheme thus will be the best way to keep your projects
happy.

What we do is making _transducers_, not automata. Automatota can be
seen as filters (return input or not), whereas transducer gives
different output (for input a, returns b or nothing). Typically, an
automaton returns "feet" for "feet" and nothing for the typo "fee5t"
This is all you need for a speller (which implements red lines in case
of no-feedback, and goes on to calculate correction candidates), but
for transducers you have more possibilities. For "feet" you may return
e.g. "foot+N+Pl" (a grammatical analysis), or "pieds" (a grammatically-
enriched dictionary). Transducers may be utilised in grammar checkers,
in machine translation, etc. etc.

Now to the presentation of our work:

We actually make two different spellers, one for MS Word, and one for
OpenOffice.


Basis:
======

We do our morphology as finite-state transducers. Our basic code is
written with Xerox tools, lexc, twolc, xfst (http://fsmbook.com/).

As a part of those transducers we have a separate spellrelax file,
which acts as I described: For each glyph (or even set of letters,
like ä/æ) with several characters associated with it, we generate a
parallel form with the variant character. The file in question is a
tiny script (cf. attachment), and our transducers thus are tolerant
wrt. character variation. A parallel treatmen for English would e.g.
be to have ç and ô in the source code (façade, rôle), but then
tolerate c pro ç etc (always accept facade, role as well).

The method is thus: compile your relaxing transducer on top of your
ordinary transducer:

yourlanguage.fst .o. spellrelax.fst = yourspellrelaxedlanguage.fst

where the files are transducers:

yourlanguage.fst reads "geese" and gives "goose+N+Pl"

spellrelax.fst reads ä and interprets it as both ä and æ

the combination of the two transducers will give a reading for both
"täksta" and "tæksta" (two varieties of the word for "text", although
both are found in the lexicon.
But if, say a Swedish name (Gävle) incorrectly is added as "Gævle",
the system will return an error, since the base contains "tæksta,
Gävle", and only ä is relaxed, not æ.

The book reffered to at http://fsmbook.com/ will discuss this
(examples for two versions of Portuguese), and our documentation pages
(http://giellatekno.uit.no/doc/index.html search for "spellrelax"
discusses our implementation of it).


The resulting transducers are then converted into spellcheckers along
different paths.
(our spellers are downloadable from http://divvun.no)

MS Word:
========
Here, we needed to go to a Microsoft subcontractor, in our case
Polderland. Polderlands file format was a fullform


OpenOffice:
===========
Here, we use hunspell.

Hunspell is, as we see it, an enriched version of ispell, hence an
automaton (returns input or not), and not a transducer (for input a,
returns b or nothing). Our transducers are cascaded, so that we can
handle suffixation in one component and inner inflection (umlaut-like
processes) in another. An automaton forces us to "cut early" (I use
German Umlaut as example, assuming it to be more familiar, the actual
processes we model are Sami consonant gradation, but for the principle
behind the differences do not matter):

Kind + er -> Kinder
Buch + er -> Bücher

In our approach, we give both the same suffix -er, but Buch is given
an UML "suffix" as well, and then we have an umlaut rule in a parallel
transducer:
Buch + UML + er -> BuchUMLer (via rule u:ü <=> _ * UML ;) -> Bücher

With simple automata, Buch must have the single-letter "stem" B, and
then adds either "uch" or "ücher".

Rather than writing a separate automaton for hunspell, we enrich our
transducer with hunspell-type continuation-lexicon marks, in a version
which keeps the UML etc marks (hunspell then generates the stems Buch
and BuchUML (with different continuation lexica, the latter to /er,
the former not), then we use our morphophonological transducer to
change the stem BuchUML to Büch, and we have the hunspell automaton.

The spellrelax issue (the question you asked) we fix in the same way
as we did for the MS Word version: The transducer
yourspellrelaxedlanguage.fst generates both forms, i.e. (see above:
tæksta, täksta, Gävle, ...) and all their hunspell continuation lexica.

Ideally, we would like to use our transducers directly as
spellcheckers. This is what e.g. lingsoft.fi does, and if you want an
open source version for that you might try the Stuttgart sfst or
Helsinki hfst transducer platforms. Then we would not need to generate
double lists at all, we would just induce the spellrelax.fst as part
of the filter.

---

As you can see from our homepage: http://divvun.no (our spellers) and
http://giellatekno.uit.no
(our transducers), we already work with several languages, and we
are as a matter of fact interested in looking at African lgs as well
(that is why I follow this list).

Arvi Hurskainen of course has a very advanced solution for Kiswahili,
and several of his students work on other Bantu lgs, as do Sonja
Borsch and Laura Pretorius in SA. I have (either alone or with
collegues and students) looked at Kinyarwanda, Nama and Amharic (and
there are good analyses of the Amharic object conjugation within this
framework)

The methods I have described here go back to Kimmo Koskenniemi's
trailblazing dissertation from 1983, it is no coincidence that a
Finnish scholar was the one to solve problems for morphology-rich
languages. Such languages are abundant in Africa, and in my opinion
approaches along (some version of) the lines sketched here should form
the cornerstone of all lg technology work on morphology-rich
languages, in Africa and elsewhere

Feel free to contact me for follow-ups.

Trond.


_______________________________________________
A12n-collaboration mailing list
A12n-collaboration@...
http://lists.kabissa.org/mailman/listinfo/a12n-collaboration



Wed Dec 31, 2008 5:26 pm

Trond.Trosterud@...
Send Email Send Email

Forward
Message #930 of 1132 |
Expand Messages Author Sort by Date

Hi Friedel,Good to hear from you.We have had a very good relationship with Nemeth Laszlo the creator of hunspel. He has responded to many of our queries with...
Tunde Adegbola
taintransit@...
Send Email
Dec 30, 2008
12:50 pm

... Dear Tunde, Andrew and others Here comes a short intro on what we do, dealing with morphology as well as with the spellrelax issue itself. Before reading...
Trond Trosterud
Trond.Trosterud@...
Send Email
Dec 31, 2008
8:42 pm

Thanks Trond. I have read through your post and taken a look at your website and I find your work quite interesting. I believe ours has a lot to learn from...
Tunde Adegbola
taintransit@...
Send Email
Dec 31, 2008
10:55 pm

Thanks to all for your contributions in this thread. As a partial summary, it sounds like the precomposed glyph approach is not as useful for addressing...
Don Osborn
bisharat_dot...
Offline Send Email
Jan 6, 2009
8:17 am

... but these problems will still exist if using precomposed glyphs in OpenType fonts. there are a couple of scenarios: 1) user has Windows Vista - text should...
Andrew Cunningham
andrewc@...
Send Email
Jan 6, 2009
11:15 pm

Hi Don, ... In the past a number of fonts have been designed this way for various languages Andrew -- Andrew Cunningham Research and Development Coordinator ...
Andrew Cunningham
andrewc@...
Send Email
Dec 27, 2008
1:23 pm

Interesting. Kinldly explain how they work. Tunde ... Tunde Adegbola (Ph.D.) Executive Director African Languages Technology Initiative (Alt-I ... Inserting...
Tunde Adegbola
taintransit@...
Send Email
Dec 27, 2008
1:23 pm

Re fonts: unless you have in mind having precomposed glyphs that are mapped to some custom encoding, there is probably no benefit to having precomposed glyphs...
Peter Constable
petercon@...
Send Email
Dec 28, 2008
3:35 am

There is one advantage to having precomposed glyphs over dynamically composed ones. You can hint precomposed glyphs exactly how you please. Dynamically...
Denis Jacquerye
moyogo@...
Send Email
Dec 28, 2008
1:32 pm

Better hinted results can be an advantage to precomposed glyphs, though I suspect hinted output is a minor detail if you are struggling to get any output at...
Peter Constable
petercon@...
Send Email
Dec 29, 2008
3:53 am

I'd be interested in the approaches Trond is taking as well for some Sudanese languages I'm working on. ... -- Andrew Cunningham Research and Development...
Andrew Cunningham
andrewc@...
Send Email
Dec 30, 2008
1:42 am
 First  |  |  Next > Last 
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help