Search the web
Sign In
New User? Sign Up
OmegaT · Free Computer Assisted Translation
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.

Messages

  Messages Help
Advanced
Messages 16758 - 16787 of 16787   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#16787 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Wed Feb 10, 2010 10:42 am
Subject: Re: [OmT] Re: segmentation rules for decorated lists
jc_helary
Offline Offline
Send Email Send Email
 
On 10 fvr. 10, at 18:43, Didier Briel wrote:

>>> So, do you make the RFE?
>>

Done. I hope there are no mistakes in the text.

Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16786 From: "Didier Briel" <d.briel@...>
Date: Wed Feb 10, 2010 9:43 am
Subject: RE: [OmT] Re: segmentation rules for decorated lists
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Jean-Christophe Helary
>Sent: Wednesday, February 10, 2010 10:10 AM
>To: OmegaT@yahoogroups.com
>Subject: Re: [OmT] Re: segmentation rules for decorated lists
>
>
>On 10 févr. 10, at 17:24, Didier Briel wrote:
>
>>>>>> ^\s*[\d\.]+\s*
>
>>>> Replace \d by \p{Nd} to get all numeric decimal (regardless of the script),
I think it
>>>> should get your full-width digits, and Thai, etc.
>>>
>>> Thank you ! But in this case I was more thinking of "." which looks like a
". " but is not :) And because there are no spaces following it (the space is
"included") the regerp would need to be edited to reflect that...
>>
>> So, do you make the RFE?
>
>No. The generic rule should work like that.

I was speaking of the general rule.

Didier

#16785 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Wed Feb 10, 2010 9:09 am
Subject: Re: [OmT] Re: segmentation rules for decorated lists
jc_helary
Offline Offline
Send Email Send Email
 
On 10 févr. 10, at 17:24, Didier Briel wrote:

>>>>> ^\s*[\d\.]+\s*

>>> Replace \d by \p{Nd} to get all numeric decimal (regardless of the script),
I think it
>>> should get your full-width digits, and Thai, etc.
>>
>> Thank you ! But in this case I was more thinking of "." which looks like a
". " but is not :) And because there are no spaces following it (the space is
"included") the regerp would need to be edited to reflect that...
>
> So, do you make the RFE?

No. The generic rule should work like that. If I add one it would be specific
for Japanese.



Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16784 From: "Didier Briel" <d.briel@...>
Date: Wed Feb 10, 2010 8:24 am
Subject: RE: [OmT] Re: segmentation rules for decorated lists
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Jean-Christophe Helary
>Sent: Wednesday, February 10, 2010 12:57 AM
>To: OmegaT@yahoogroups.com
>Subject: Re: [OmT] Re: segmentation rules for decorated lists
>
>
>On 10 févr. 10, at 07:44, Yves Savourel wrote:
>
>>>> No issue with the right one:
>>>> ^\s*[\d\.]+\s*
>>>>
>>>> To the best of readers' knowledge, would this segmentation
>>>> apply to all languages, without too much side effect?
>>>
>>> It would work for Japanese in most cases, for the cases where
>>> the text use double bytes equivalents it would be enough to
>>> add them manually I suppose.
>>
>> Replace \d by \p{Nd} to get all numeric decimal (regardless of the script), I
think it
>> should get your full-width digits, and Thai, etc.
>
>Thank you ! But in this case I was more thinking of "." which looks like a ".
" but is not :) And because there are no spaces following it (the space is
"included") the regerp would need to be edited to reflect that...

So, do you make the RFE?

Didier

#16783 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Wed Feb 10, 2010 5:11 am
Subject: Re: [OmT] Re: segmentation rules and RegEx on general
jc_helary
Offline Offline
Send Email Send Email
 
On 10 fvr. 10, at 13:03, Bruce Miller wrote:

> Except that the latest versions are compiled for Windows only.  :-(

What you can do is test the regexp in a text editor that accepts them.

That's how I usually do. On Mac I use Textwrangler.


Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16782 From: Bruce Miller <subscribe@...>
Date: Wed Feb 10, 2010 4:03 am
Subject: Re: [OmT] Re: segmentation rules and RegEx on general
brm_ottawa
Offline Offline
Send Email Send Email
 
>From: Maynard Hogg <maynard.hogg@...>
>To: OmegaT@yahoogroups.com
>Sent: Tue, February 9, 2010 10:36:05 PM
>Subject: Re: [OmT] Re: segmentation rules and RegEx on general
>
>
>On Wed, Feb 10, 2010 at 05:44, smo <smolejv@gmx. net> wrote:
>>> There's all kinds of Regex sandboxes available for download, where one can
test the re4gular expression against the target text.
>
>>> I have not checked the problem, so I can not comment. But if I wanted to, I
would use regex coach for instance (see OmegaT documentation 2.0.x) to get some
idea what works / does not work and why it does/does not.
>
>http://weitz. de/regex- coach/
>
>>Looks good! Now I can experiment without reloading my current project.

Except that the latest versions are compiled for Windows only.  :-(

--
Bruce Miller, Ottawa, Ontario, Canada
bruce@...; (613) 745-1151

Just when you think your software is idiot proof, somebody comes up with a
better idiot

Keyboard not found...Press any key to continue.

#16781 From: Maynard Hogg <maynard.hogg@...>
Date: Wed Feb 10, 2010 3:44 am
Subject: Re: [OmT] Re: segmentation rules for decorated lists
maynard_hogg
Offline Offline
Send Email Send Email
 
On Wed, Feb 10, 2010 at 12:26, Maynard Hogg <maynard.hogg@...> wrote:
>
^\s*[①-⑳ⅰ-ⅹ●→←↑↓⇔⇒★☆※▼▽▲△■□◆◇○●\
]+\s*

Regex Coach (http://weitz.de/regex-coach/) has trouble displaying all
these Unicode doodads, but seems to work just fine otherwise.

#16780 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Wed Feb 10, 2010 3:45 am
Subject: Re: [OmT] Re: segmentation rules for decorated lists
jc_helary
Offline Offline
Send Email Send Email
 
On 10 févr. 10, at 12:26, Maynard Hogg wrote:

> ^\s*[\p{Nd}\..]+\s*
> (Thanks, Yves and JCH, for the updates.)
>
> I was going to object that it would split things like the following.
> (I'm working on Specifications tables this morning.)
>
> 5.7Km
> 3.14Kg

Well, you can always create rules that exclude specific strings. That's what
rules are for :) Depending on how often the strings appear that may be too much
work for little benefit.



Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16779 From: Maynard Hogg <maynard.hogg@...>
Date: Wed Feb 10, 2010 3:36 am
Subject: Re: [OmT] Re: segmentation rules and RegEx on general
maynard_hogg
Offline Offline
Send Email Send Email
 
On Wed, Feb 10, 2010 at 05:44, smo <smolejv@...> wrote:
> There's all kinds of Regex sandboxes available for download, where one can
test the re4gular expression against the target text.

> I have not checked the problem, so I can not comment. But if I wanted to, I
would use regex coach for instance (see OmegaT documentation 2.0.x) to get some
idea what works / does not work and why it does/does not.

http://weitz.de/regex-coach/

Looks good! Now I can experiment without reloading my current project.

#16778 From: Maynard Hogg <maynard.hogg@...>
Date: Wed Feb 10, 2010 3:26 am
Subject: Re: [OmT] Re: segmentation rules for decorated lists
maynard_hogg
Offline Offline
Send Email Send Email
 
On Sat, Feb 6, 2010 at 17:50, Jean-Christophe Helary
<jean.christophe.helary@...> wrote:
> Just thinking out loud. What about:
> ^\s*[\d\.]+\s*

^\s*[\p{Nd}\..]+\s*
(Thanks, Yves and JCH, for the updates.)

I was going to object that it would split things like the following.
(I'm working on Specifications tables this morning.)

5.7Km
3.14Kg

But would that be such a bad thing?
Especially if OT put the space in the translation.

Alas, Japanese typesetting doesn't seem to have the North American
rule about never starting a sentence with numerals. (A rule regularly
broken at 3com.com.)

Next step: Numerals in various brackets: (1), [2], 3),...

P.S. Circled numerals should be easy.

^\s*[①-⑳ⅰ-ⅹ●→←↑↓⇔⇒★☆※▼▽▲△■□◆◇○●\
]+\s*

#16777 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Tue Feb 9, 2010 11:57 pm
Subject: Re: [OmT] Re: segmentation rules for decorated lists
jc_helary
Offline Offline
Send Email Send Email
 
On 10 févr. 10, at 07:44, Yves Savourel wrote:

>>> No issue with the right one:
>>> ^\s*[\d\.]+\s*
>>>
>>> To the best of readers' knowledge, would this segmentation
>>> apply to all languages, without too much side effect?
>>
>> It would work for Japanese in most cases, for the cases where
>> the text use double bytes equivalents it would be enough to
>> add them manually I suppose.
>
> Replace \d by \p{Nd} to get all numeric decimal (regardless of the script), I
think it
> should get your full-width digits, and Thai, etc.

Thank you ! But in this case I was more thinking of "." which looks like a ".
" but is not :) And because there are no spaces following it (the space is
"included") the regerp would need to be edited to reflect that...


Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16776 From: "Yves Savourel" <yves@...>
Date: Tue Feb 9, 2010 10:44 pm
Subject: RE: [OmT] Re: segmentation rules for decorated lists
yves_savourel
Offline Offline
Send Email Send Email
 
>> No issue with the right one:
>> ^\s*[\d\.]+\s*
>>
>> To the best of readers' knowledge, would this segmentation
>> apply to all languages, without too much side effect?
>
> It would work for Japanese in most cases, for the cases where
> the text use double bytes equivalents it would be enough to
> add them manually I suppose.

Replace \d by \p{Nd} to get all numeric decimal (regardless of the script), I
think it
should get your full-width digits, and Thai, etc.

-ys

#16775 From: Jean-Christophe Helary <jean.christophe.helary@...>
Date: Tue Feb 9, 2010 10:25 pm
Subject: Re: [OmT] Re: segmentation rules for decorated lists
jc_helary
Offline Offline
Send Email Send Email
 
On 10 fvr. 10, at 06:47, Didier Briel wrote:

> No issue with the right one:
> ^\s*[\d\.]+\s*
>
> To the best of readers' knowledge, would this segmentation apply to all
> languages, without too much side effect?

It would work for Japanese in most cases, for the cases where the text use
double bytes equivalents it would be enough to add them manually I suppose.



Jean-Christophe Helary
---------------------------------
fun: mac4translators.blogspot.com
work: www.doublet.jp (ja/en > fr)
tweets: @brandelune

#16774 From: "Didier Briel" <d.briel@...>
Date: Tue Feb 9, 2010 9:47 pm
Subject: RE: [OmT] Re: segmentation rules for decorated lists
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Bruce Miller
>Sent: Tuesday, February 09, 2010 7:37 PM
>To: OmegaT@yahoogroups.com
>Subject: Re: [OmT] Re: segmentation rules for decorated lists

>I readily understand Didier's misgivings. My perspective is that of a
non-quite retired bureaucrat which means that I deal with numbered lists far
more frequently than with references to software versions. Didier's
situation is the opposite.
>
>I have two questions:
>1. I would have expected the first four characters of the string, namely
^\s* , to have prevented the problem that Didier is reporting. Why do they
not do so?

You're right, I had forgotten about that (this afternoon, I just wanted to
finish my translation).
Actually, I was using a wrong expression (one which had been discussed in
this thread, but not the final one).
No issue with the right one:
^\s*[\d\.]+\s*

>2. For my own education, can someone remind me (I knew once but have
forgotten), what purpose is served by the square brackets around " \d\. ?"

That's defining a "character class". I.e., you are not defining the sequence
number-followed-by-a-dot, you are defining the class "any number or any
dot".

>What is the undesired result if they are omitted?

You are describing a strict sequence "number then dot" repeated one or more
times. Depending on what you actually have in your text, that's not the same
thing.
To the best of readers' knowledge, would this segmentation apply to all
languages, without too much side effect?

Didier

#16773 From: "Didier Briel" <d.briel@...>
Date: Tue Feb 9, 2010 9:34 pm
Subject: RE: [OmT] forcing Unicode instead of ANSI in target properties files
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
smo
>Sent: Tuesday, February 09, 2010 9:52 PM
>To: OmegaT@yahoogroups.com
>Subject: [OmT] forcing Unicode instead of ANSI in target properties files
>
>In doing the properties files, the project_save.tmx shows correct
characters, for instance
>
>      </tuv>
>      <tuv lang="Sl-SI">
>        <seg>Najdeni so podatki iz naslednjih programov na vaem
raunalniku.</seg>
>
>The target segment, however, looks like this:
>
>Najdeni so podatki iz naslednjih programov na va\u0161em ra\u010dunalniku.
>
>i.e. the "non-ANSI" characters are transliterated. It's a nuisance,
although probably sensible and done with good intentions.

It's because it's a filter for Java properties files, which are supposed to
only contain "ISO 8859-1" (actually ASCII, for most purposes) characters.

>Fact is, the file is ANSI-encoded and there's no way I can convince the
properties file filter for to allow UTF8 for output instead of <auto>.
>
>It must be so simple that it's easy to overlook (g).

You can use the .ini filter, which shouldn't do that.

There are very few differences between the two filters (I don't remember
exactly them right now), but check anyway that you have no issue.

Didier

#16772 From: "smo" <smolejv@...>
Date: Tue Feb 9, 2010 8:52 pm
Subject: forcing Unicode instead of ANSI in target properties files
smolejv
Offline Offline
Send Email Send Email
 
In doing the properties files, the project_save.tmx shows correct characters,
for instance

       </tuv>
       <tuv lang="Sl-SI">
         <seg>Najdeni so podatki iz naslednjih programov na vaem
raunalniku.</seg>

The target segment, however, looks like this:

Najdeni so podatki iz naslednjih programov na va\u0161em ra\u010dunalniku.

i.e. the "non-ANSI" characters are transliterated. It's a nuisance, although
probably sensible and done with good intentions. Fact is, the file is
ANSI-encoded and there's no way I can convince the properties file filter for to
allow UTF8 for output instead of <auto>.

It must be so simple that it's easy to overlook (g).

TiA

smo

#16771 From: "smo" <smolejv@...>
Date: Tue Feb 9, 2010 8:44 pm
Subject: [OmT] Re: segmentation rules and RegEx on general
smolejv
Offline Offline
Send Email Send Email
 
There's all kinds of Regex sandboxes available for download, where one can test
the re4gular expression against the target text.

I have not checked the problem, so I can not comment. But if I wanted to, I
would use regex coach for instance (see OmegaT documentation 2.0.x) to get some
idea what works / does not work and why it does/does not.

Regards

smo

#16770 From: Bruce Miller <subscribe@...>
Date: Tue Feb 9, 2010 6:37 pm
Subject: Re: [OmT] Re: segmentation rules for decorated lists
brm_ottawa
Offline Offline
Send Email Send Email
 
>From: Didier Briel <d.briel@...>
>To: OmegaT@yahoogroups.com
>Sent: Tue, February 9, 2010 12:33:09 PM
>Subject: RE: [OmT] Re: segmentation rules for decorated lists
>
>-----Original Message-----
>>>From: OmegaT@yahoogroups. com [mailto:OmegaT@yahoogroups. com]On Behalf Of
>>Didier Briel
>>>Sent: Saturday, February 06, 2010 9:56 AM
>>>To: OmegaT@yahoogroups. com
>>>Subject: RE: [OmT] Re: segmentation rules for decorated lists
>
>>>-----Original Message-----
>>>>From: OmegaT@yahoogroups. com [mailto:OmegaT@yahoogroups. com]On Behalf Of
>>>Jean-Christophe Helary
>>>>^\s*[\d\.] +\s*
>>>>
>>>>I just tried it in a text editor and it seems to work well.
>>>
>>>It seems to work fine
>
>>I seemed to remember remotely there was an issue with this.
>
>>I just stumbled upon it again.
>
>>If you have texts about software versions (that means nearly always for me),
>>the sentences are split in the middle. Which I think would be worse, by
>>default, than not segmenting.
>
>>Real example:
>
>>The file is created from version 5.3.2 on.
>
>>With the rule above, I get
>
>>The file is created from version 5.3.2
>>on.
>
>>Didier

^\s*[\d\.] +\s*

I started this thread last week and am grateful to all who have contributed to
it.

I readily understand Didier's misgivings. My perspective is that of a non-quite
retired bureaucrat which means that I deal with numbered lists far more
frequently than with references to software versions. Didier's situation is the
opposite.

I have two questions:
1. I would have expected the first four characters of the string, namely ^\s* ,
to have prevented the problem that Didier is reporting. Why do they not do so?
2. For my own education, can someone remind me (I knew once but have forgotten),
what purpose is served by the square brackets around " \d\. ?" What is the
undesired result if they are omitted?

#16769 From: "Didier Briel" <d.briel@...>
Date: Tue Feb 9, 2010 5:33 pm
Subject: RE: [OmT] Re: segmentation rules for decorated lists
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Didier Briel
>Sent: Saturday, February 06, 2010 9:56 AM
>To: OmegaT@yahoogroups.com
>Subject: RE: [OmT] Re: segmentation rules for decorated lists

>-----Original Message-----
>>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
>Jean-Christophe Helary
>>^\s*[\d\.]+\s*
>>
>>I just tried it in a text editor and it seems to work well.
>
>It seems to work fine

I seemed to remember remotely there was an issue with this.

I just stumbled upon it again.

If you have texts about software versions (that means nearly always for me),
the sentences are split in the middle. Which I think would be worse, by
default, than not segmenting.

Real example:

The file is created from version 5.3.2 on.

With the rule above, I get

The file is created from version 5.3.2
on.

Didier

#16768 From: Lachlan Musicman <datakid@...>
Date: Tue Feb 9, 2010 5:52 am
Subject: Re: [OmT] OmegaT at FOSDEM 2010
datakid23
Offline Offline
Send Email Send Email
 
On Tue, Feb 9, 2010 at 16:47, smo <smolejv@...> wrote:

>
>
> I had a 5-min lightning talk Sunday afternoon at FOSDEM 2010 / Mozilla
> session (https://wiki.mozilla.org/Fosdem:2010/Lightning_talks). There were
> about 100 listners and quite a number of localizers.
>
> It seems that Notepad++ is still the tool of choice for localizers (with a
> sprinkle of pootle cases), which is absolutely crazy. I did well and so did
> OmegaT.
>
>
Wow, above lokalize, poedit, multilizer or pootle. Weird - I think that
lokalize in particular is a fantastic localisation product.


cheers
L.

--
The essence of a rant, in fact, is that the ranter has no idea how to fix
the thing being ranted about.
- Clay Shirky


[Non-text portions of this message have been removed]

#16767 From: "smo" <smolejv@...>
Date: Tue Feb 9, 2010 5:47 am
Subject: OmegaT at FOSDEM 2010
smolejv
Offline Offline
Send Email Send Email
 
I had a 5-min lightning talk Sunday afternoon at FOSDEM 2010 / Mozilla session
(https://wiki.mozilla.org/Fosdem:2010/Lightning_talks). There were about 100
listners and quite a number of localizers.

It seems that Notepad++ is still the tool of choice for localizers (with a
sprinkle of pootle cases), which is absolutely crazy. I did well and so did
OmegaT.

What is FOSDEM?

FOSDEM is a free and non-commercial event organized by the community for the
community. The goal is to provide Free Software and Open Source developers and
communities a place to meet. Attendance this year

This year marks the introduction of the for-all FOSDEM dance - see
http://tinyurl.com/y9eatph

Regards

Vito

#16766 From: "Didier Briel" <d.briel@...>
Date: Mon Feb 8, 2010 10:28 pm
Subject: RE: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Anthony Baldwin
>Sent: Monday, February 08, 2010 8:30 PM
>To: OmegaT@yahoogroups.com
>Subject: RE: diversion/question RE: [OmT] Glossary files extracted from
Wiktionary
>> You don't have to change them, just renaming them to .csv
>> (e.g.,
>> xxx.utf8.csv) is enough.

>Really?
>I don't have to put commas in them?
>huh...

Yes, it's that simple. After that, Calc will display a dialog in which you
can select Tab as the separator.

Didier

#16765 From: Anthony Baldwin <anthonyebaldwin@...>
Date: Mon Feb 8, 2010 9:49 pm
Subject: Re: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
anthonyebaldwin
Offline Offline
Send Email Send Email
 
--- El lun 8-feb-10, Sabine Emmy Eller <s.cretella@...> escribi:

> De: Sabine Emmy Eller <s.cretella@...>
> Asunto: Re: diversion/question RE: [OmT] Glossary files extracted from 
Wiktionary
> A: OmegaT@yahoogroups.com
> Fecha: lunes, 8 febrero, 2010, 2:03 pm
> For some reason, I am not able to
> open them in oocalc.
>
> > When I try, oowriter opens them. I'm bummed.
> >
> > would it be useful to change them csv files, then open
> in oocalc?
> > (like this?
> > rename 's/\t/,\g' *.utf8
> > then
> > rename 's/.utf8/.csv/g' *.utf8
> > )
> >
> > Then work the magic in oocalc, and, of course, save as
> .utf8 or .tab
> >
> >
> > Which language combinations do you need? Don't get
> crazy to do things
> manually.
>
> Cheers, Sabine
>
>

I work in Portuguese>English, Spanish>English and French>English.

thanks,
tony

--
http://www.baldwinlinguas.com
translations & interpreting

http://www.baldwinsoftware.com
tcl yer os with a feather


      
________________________________________________________________________________\
____
Obtn la mejor experiencia en la web!
Descarga gratis el nuevo Internet Explorer 8.
http://downloads.yahoo.com/ieak8/?l=e1

#16764 From: Anthony Baldwin <anthonyebaldwin@...>
Date: Mon Feb 8, 2010 7:30 pm
Subject: RE: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
anthonyebaldwin
Offline Offline
Send Email Send Email
 
--- El lun 8-feb-10, Didier Briel <d.briel@...> escribi:

> De: Didier Briel <d.briel@...>
> Asunto: RE: diversion/question RE: [OmT] Glossary files extracted from
Wiktionary
> A: OmegaT@yahoogroups.com
> Fecha: lunes, 8 febrero, 2010, 2:13 pm
> -----Original Message-----
> >From: OmegaT@yahoogroups.com
> [mailto:OmegaT@yahoogroups.com]On
> Behalf Of
> Anthony Baldwin
> >Sent: Monday, February 08, 2010 7:57 PM
> >To: OmegaT@yahoogroups.com
> >Subject: RE: diversion/question RE: [OmT] Glossary
> files extracted from
> Wiktionary
>
> >> Open the glossary in a spreadsheet, swap the first
> two
> >> columns, and save
> >> again as UTF-8 text.
> >>
> >For some reason, I am not able to open them in oocalc.
> >When I try, oowriter opens them. I'm bummed.
> >
> >would it be useful to change them csv files, then open
> in oocalc?
>
> You don't have to change them, just renaming them to .csv
> (e.g.,
> xxx.utf8.csv) is enough.
>
> Didier
>

Really?
I don't have to put commas in them?
huh...

/tony

--
http://www.baldwinlinguas.com
translations & interpreting

http://www.baldwinsoftware.com
tcl yer os with a feather


      
________________________________________________________________________________\
____
Obtn la mejor experiencia en la web!
Descarga gratis el nuevo Internet Explorer 8.
http://downloads.yahoo.com/ieak8/?l=e1

#16763 From: "Didier Briel" <d.briel@...>
Date: Mon Feb 8, 2010 7:13 pm
Subject: RE: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
didier_briel
Offline Offline
Send Email Send Email
 
-----Original Message-----
>From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
Anthony Baldwin
>Sent: Monday, February 08, 2010 7:57 PM
>To: OmegaT@yahoogroups.com
>Subject: RE: diversion/question RE: [OmT] Glossary files extracted from
Wiktionary

>> Open the glossary in a spreadsheet, swap the first two
>> columns, and save
>> again as UTF-8 text.
>>
>For some reason, I am not able to open them in oocalc.
>When I try, oowriter opens them. I'm bummed.
>
>would it be useful to change them csv files, then open in oocalc?

You don't have to change them, just renaming them to .csv (e.g.,
xxx.utf8.csv) is enough.

Didier

#16762 From: Sabine Emmy Eller <s.cretella@...>
Date: Mon Feb 8, 2010 7:03 pm
Subject: Re: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
s.cretella@...
Send Email Send Email
 
For some reason, I am not able to open them in oocalc.

> When I try, oowriter opens them. I'm bummed.
>
> would it be useful to change them csv files, then open in oocalc?
> (like this?
> rename 's/\t/,\g' *.utf8
> then
> rename 's/.utf8/.csv/g' *.utf8
> )
>
> Then work the magic in oocalc, and, of course, save as .utf8 or .tab
>
>
> Which language combinations do you need? Don't get crazy to do things
manually.

Cheers, Sabine


[Non-text portions of this message have been removed]

#16761 From: Anthony Baldwin <anthonyebaldwin@...>
Date: Mon Feb 8, 2010 6:56 pm
Subject: RE: diversion/question RE: [OmT] Glossary files extracted from Wiktionary
anthonyebaldwin
Offline Offline
Send Email Send Email
 
--- El lun 8-feb-10, Didier Briel <d.briel@...> escribi:

> De: Didier Briel <d.briel@...>
> Asunto: RE: diversion/question RE: [OmT] Glossary files extracted from
Wiktionary

> >Quick question re: the above mentioned glossaries.
> >They all seem to be EN->$language.
> >
> >Are they, like .tmx files, useful in reverse?
>
> Yes, but you have to reverse them "by hand" (see below).
>
> All you need to reverse them is a spreadsheet (like for any
> OmegaT glossary
> file).
> Open the glossary in a spreadsheet, swap the first two
> columns, and save
> again as UTF-8 text.
>
> Didier

For some reason, I am not able to open them in oocalc.
When I try, oowriter opens them. I'm bummed.

would it be useful to change them csv files, then open in oocalc?
(like this?
rename 's/\t/,\g' *.utf8
then
rename 's/.utf8/.csv/g' *.utf8
)

Then work the magic in oocalc, and, of course, save as .utf8 or .tab
?
tony


--
http://www.baldwinlinguas.com
translations & interpreting

http://www.baldwinsoftware.com
tcl yer os with a feather



      
________________________________________________________________________________\
____
Obtn la mejor experiencia en la web!
Descarga gratis el nuevo Internet Explorer 8.
http://downloads.yahoo.com/ieak8/?l=e1

#16760 From: "Desilets, Alain" <alain.desilets@...>
Date: Mon Feb 8, 2010 3:56 pm
Subject: RE: [OmT] Glossary files extracted from Wiktionary and VH datacollection online
alain_desilets
Offline Offline
Send Email Send Email
 
> I know the work of Daniel, aka Duesentrieb, which is good work done. We
> decided not to use it, because it still requires much manual work
> afterwards.

That's good to know. I thought the work looked promising, but I didn't know how
well it would work in practice, so your experience with that stuff is useful
info for me.

Alain

#16759 From: Sabine Emmy Eller <s.cretella@...>
Date: Mon Feb 8, 2010 3:01 pm
Subject: Re: [OmT] Glossary files extracted from Wiktionary and VH datacollection online
s.cretella@...
Send Email Send Email
 
> Someone (I think his name was Daniel Kinzler) did some work to try and
> automatically resolve this kind of issue, but analyzing the topology of
> interlingual links in order to come up with clean multilingual term entries.
> Do you know of his work?
>
> Alain
>
>
Hi Alain,

Well, I have been involved in multilingual dictionaries for many years now
and initially much of my experience went into a project you probably know
well, this up to the moment when I saw that getting data out of what was
created was almost impossible. For me/us multilingual data I/we work on has
only sense when I/we can easily hand out the data to people. All these
projects are collaborative and require help from people, therefore one of
the first things to consider is "how to pay them back their invested time"
in such a way that by contributing all, all will have advantages.

I know the work of Daniel, aka Duesentrieb, which is good work done. We
decided not to use it, because it still requires much manual work
afterwards.

We, that is Vox Humanitatis, work with so called regions. We have our own
one and for example Agrovoc, where we have the permission to integrate, is
another one. There is also a bunch of others, but mentioning them all would
become a lot. en.wiktionary could become an own region, we are considering
it, but it is still very much a "could be" situation. For now I am working
with an external table.

It would be also too much to explain in a detailed way what we are working
on here. If you are interested you might want to read up the (even a bit
outdated) documentation here:
http://www.voxhumanitatis.org/content/ambaradan-owm2-storage-engine

I hope this helps to get some insight in what we do :-)

Cheers, Sabine


[Non-text portions of this message have been removed]

#16758 From: "Desilets, Alain" <alain.desilets@...>
Date: Mon Feb 8, 2010 1:50 pm
Subject: RE: [OmT] Glossary files extracted from Wiktionary and VH datacollection online
alain_desilets
Offline Offline
Send Email Send Email
 
> > Very cool. Do you have plans to do the same thing with Wikipedia
> > entries?
>
> Well that is a good question ... you can do that by extracting the
> interwiki links, but there is an actual problem to it: for example you
> have the page
>
> sour cherry
>
> which for some languages links to
>
> cherry (Kirsche, ciliegia etc.)
>
> and this is not just some pages, this is loads of it. Also for chemical
> compounds - some wikipedias have one page per compound and others have
> collective pages.

Someone (I think his name was Daniel Kinzler) did some work to try and
automatically resolve this kind of issue, but analyzing the topology of
interlingual links in order to come up with clean multilingual term entries. Do
you know of his work?

Alain

Messages 16758 - 16787 of 16787   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help