Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

edict-jmdict · The JMdict/EDICT Group

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 140
  • Category: Other
  • Founded: Jul 18, 2006
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 4537 - 4566 of 4980   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#4537 From: Jim Breen <jimbreen@...>
Date: Sat Sep 3, 2011 1:51 am
Subject: Re: Re: [abbr=...] (Abbreviation cross-references?)
breen_jim
Send Email Send Email
 
On 31 August 2011 13:36, Stuart McGraw <smcg4191@...> wrote:

> I have not looked at the code for a while so I reserve the
> right to retract this, but I think the xref/abbr change
> won't require any changes to the database structure, just
> the code and db contents will need changing.

I suspected that would be the case, however to carry it cleanly
through into JMdict I'd need to change the DTD and generator,
and if there are a few such changes in the pipeline it would be
best to batch them.

> Not specifically for DB changes (that is, changes to
> the definitions of the tables, views, and other objects
> in the database) but there is a list of outstanding
> tasks:
>
> http://www.edrdg.org/~smg/jmdict/TODO.html
>
> In some cases is it not clear without further thinking or
> sometimes experimentation, whether a db structure change
> will be needed to complete a task.

I/we should go through the list, esp the high/medium ones
and select a batch to concentrate on.

One feature I'd like to see is an email notification associated with
edits, either automatic (if X has proposed a change to an
entry, then they get emailed the edit history, either automatically
or on the request of an editor. I often email people directly to
advise the outcome of an edit or to ask for further information, and
having this (semi)automated would be great.
>
>> 2. in general, on what time scale do we expect the DB format
>>  to change? Annually? 5 years? 20 years?
>
> DB is changed whenever needed -- there is no particular
> schedule although obviously it is not something done casually.
> When a change is made, the scripts that create a new database
> are changed, and a sql patch file (of set of same) are made
> that will update an existing database to the new structure.
> This hopefully keeps things working and in sync whether
> installing the software for the first time, or updating an
> existing install (as the wwwjdict submission system is.)

One year in, it's working pretty well from my point of view. Of course
it depends on some dedicated  editors.

> Of course, all these considerations are modulo the time
> needed from Jim, and availability of same for him.

Naruhodo.

One thing I have noticed it that we often get into lengthy
discussions via the comments on the entries. I think these
form a very valuable record, and it's great to have them there.
I think they'd be even more valuable if they were more visible
to the community. Some other dictionary systems have a front
page with summaries, recent additions, recent comments, etc.
Something like that sitting at the front of the system would
be great. At present the raw functionality is great, but the PR
aspects are not so clear.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4538 From: Stuart McGraw <smcg4191@...>
Date: Sun Sep 4, 2011 10:59 pm
Subject: Re: Re: [abbr=...] (Abbreviation cross-references?)
smcgr4444
Send Email Send Email
 
On 09/02/2011 07:51 PM, Jim Breen wrote:
> On 31 August 2011 13:36, Stuart McGraw <smcg4191@...> wrote:
>
>> I have not looked at the code for a while so I reserve the
>> right to retract this, but I think the xref/abbr change
>> won't require any changes to the database structure, just
>> the code and db contents will need changing.
>
> I suspected that would be the case, however to carry it cleanly
> through into JMdict I'd need to change the DTD and generator,
> and if there are a few such changes in the pipeline it would be
> best to batch them.

A lot of the proposed xml changes IIRC would affect the
xml only, not the database.  And database changes are not
not a big problem -- besides the changing the definition
files, I also create "upgrade" files to migrate an existing
database to the new version and there aren't a large number
of external users of the database AFAIK.

So the limiting factor is probably the desire not to change
the dtd too often (which I understand).  Perhaps looking at
how often the dtd has been changed in the past and what
changes were made each time could give guidance as to when
another change is acceptable and how big/small it can be?

>> Not specifically for DB changes (that is, changes to
>> the definitions of the tables, views, and other objects
>> in the database) but there is a list of outstanding
>> tasks:
>>
>>  http://www.edrdg.org/~smg/jmdict/TODO.html
>>
>> In some cases is it not clear without further thinking or
>> sometimes experimentation, whether a db structure change
>> will be needed to complete a task.
>
> I/we should go through the list, esp the high/medium ones
> and select a batch to concentrate on.

If you can give some feedback as to which ones you'd like
to see prioritized, that would be useful.  I'll see if I
can summarize the ones on my list, and scan past list posts
for discussions.  I recall I was particularly keen to see
revision of the xref tags (which would include the xref
abbr type) because it would eliminate the need for resolving
xref targets heuristically and I could get rid of a lot of
very complex and non-confidence-inspiring import code.

Perhaps by focusing on a set of changes there is mostly
agreement on, a large enough set of changes can be found
to justify a dtd change?

I fear that if we seek a too big "super-update", then there
will never be agreement on all the details, and nothing at
all will ever get done.

> One feature I'd like to see is an email notification associated with
> edits, either automatic (if X has proposed a change to an
> entry, then they get emailed the edit history, either automatically
> or on the request of an editor. I often email people directly to
> advise the outcome of an edit or to ask for further information, and
> having this (semi)automated would be great.

That's pretty doable I think.  (But see my comment below re
web frameworks.)  A problem (which is also a problem for the
url on the submission "thank you" page, is that a link to a
specific entry (which one would want to send in the email)
may become invalid if the entry in approved, and a link to
the edit tree as a whole can make it hard to find one's entry
if there is much branching.  A to-do item is to find a way of
presenting edits without all the comment duplication that is
present in the current "updates" pages.  Whatever solution is
found here would be applicable to the urls used in email
responses.

>>> 2. in general, on what time scale do we expect the DB format
>>>   to change? Annually? 5 years? 20 years?
>>
>> DB is changed whenever needed -- there is no particular
>> schedule although obviously it is not something done casually.
>> When a change is made, the scripts that create a new database
>> are changed, and a sql patch file (of set of same) are made
>> that will update an existing database to the new structure.
>> This hopefully keeps things working and in sync whether
>> installing the software for the first time, or updating an
>> existing install (as the wwwjdict submission system is.)
>
> One year in, it's working pretty well from my point of view. Of course
> it depends on some dedicated  editors.

I am working an an fully-automated AI editor but I still
have a few bugs to work out.  :-)

>> Of course, all these considerations are modulo the time
>> needed from Jim, and availability of same for him.
>
> Naruhodo.
>
> One thing I have noticed it that we often get into lengthy
> discussions via the comments on the entries. I think these
> form a very valuable record, and it's great to have them there.
> I think they'd be even more valuable if they were more visible
> to the community. Some other dictionary systems have a front
> page with summaries, recent additions, recent comments, etc.
> Something like that sitting at the front of the system would
> be great. At present the raw functionality is great, but the PR
> aspects are not so clear.

Agreed.  I did not appreciate the social networking aspects
of it when I came up with the original design.  I was
envisioning something like a simple source code control
system with the comments being terse rationales for the
changes made in an edit, not the sort of discussion board
it seems to be being used as.

(As an aside, since the comments and references *are* generally
useful, I would be happy to see them distributed in some form;
if not in the xml, then perhaps as auxiliary files.  Information
accessible only from someone's web site puts the information
somewhat at risk.)

One of the things I've being wondering about, slightly looking
into, is redoing the web pages with some kind of web framework.
Such a framework would have a lot of features like authentication
and sessions that I've implemented in a half-assed way.
Probably email responses would be another feature so we might
want to look into frameworks before spending a lot of time
implementing email responses by hand first.

A mixed blessing with Python is that there in no canonoical
package which means one can pick among a bunch of contenders
with different strengths and weaknesses but that also means
a big time commitment to evaluate them in depth.

Another possibility would be to go up another level and use
some kind of prebuilt discussion forum / social networking
package into which the code for the database updates could
be integrated.  However, I have no idea what the options are.

#4539 From: Jim Breen <jimbreen@...>
Date: Wed Sep 7, 2011 1:34 am
Subject: Re: Re: [abbr=...] (Abbreviation cross-references?)
breen_jim
Send Email Send Email
 
I'll try and be brief here, but there are a number of issues to cover.
For me timing is an issue too. I'm off to Japan in two weeks or so,
and I'll be there for 3 weeks, so with things that have to be done
before going, I won't be able to do much before late October

On 5 September 2011 08:59, Stuart McGraw <smcg4191@...> wrote:
> On 09/02/2011 07:51 PM, Jim Breen wrote:
>> On 31 August 2011 13:36, Stuart McGraw <smcg4191@...> wrote:
[...]
> A lot of the proposed xml changes IIRC would affect the
> xml only, not the database. And database changes are not
> not a big problem -- besides the changing the definition
> files, I also create "upgrade" files to migrate an existing
> database to the new version and there aren't a large number
> of external users of the database AFAIK.
>
> So the limiting factor is probably the desire not to change
> the dtd too often (which I understand). Perhaps looking at
> how often the dtd has been changed in the past and what
> changes were made each time could give guidance as to when
> another change is acceptable and how big/small it can be?

The JMdict DTD has been frozen for years. This could be an
indication that it's not too bad, but possibly it's more that no-one
uses it much, and hence there's not much pressure for change.

For me the main changes I'd like are:
(a)  making much more use of attributes rather than the myriad
of entity-types (consolidating xref and ant would be part of this.)
(b) fixing the broken dot-separator in xrefs (#208)
(c) adding an entry-wide visible comment. At present the <s_inf/>
is really tied to a sense.
[..]
>>> http://www.edrdg.org/~smg/jmdict/TODO.html
[...]
>> I/we should go through the list, esp the high/medium ones
>> and select a batch to concentrate on.
>
> If you can give some feedback as to which ones you'd like
> to see prioritized, that would be useful. I'll see if I
> can summarize the ones on my list, and scan past list posts
> for discussions. I recall I was particularly keen to see
> revision of the xref tags (which would include the xref
> abbr type) because it would eliminate the need for resolving
> xref targets heuristically and I could get rid of a lot of
> very complex and non-confidence-inspiring import code.

Yes, very much that one. A lot was needed in the early days
when targets were vague, but we have no more orphans so
it can be tightened.

> Perhaps by focusing on a set of changes there is mostly
> agreement on, a large enough set of changes can be found
> to justify a dtd change?
>
> I fear that if we seek a too big "super-update", then there
> will never be agreement on all the details, and nothing at
> all will ever get done.

Amen to both above.

>> One feature I'd like to see is an email notification associated with
>> edits, either automatic (if X has proposed a change to an
>> entry, then they get emailed the edit history, either automatically
>> or on the request of an editor. I often email people directly to
>> advise the outcome of an edit or to ask for further information, and
>> having this (semi)automated would be great.
>
> That's pretty doable I think. (But see my comment below re
> web frameworks.) A problem (which is also a problem for the
> url on the submission "thank you" page, is that a link to a
> specific entry (which one would want to send in the email)
> may become invalid if the entry in approved, and a link to
> the edit tree as a whole can make it hard to find one's entry
> if there is much branching. A to-do item is to find a way of
> presenting edits without all the comment duplication that is
> present in the current "updates" pages. Whatever solution is
> found here would be applicable to the urls used in email
> responses.

Yes, tracking what has happened can be a bit messy. Part
of the trouble is that the "thank you" link seems to be to the
(ephemeral) database page rather than the entry number.

[...]

>> One thing I have noticed it that we often get into lengthy
>> discussions via the comments on the entries. I think these
>> form a very valuable record, and it's great to have them there.
>> I think they'd be even more valuable if they were more visible
>> to the community. Some other dictionary systems have a front
>> page with summaries, recent additions, recent comments, etc.
>> Something like that sitting at the front of the system would
>> be great. At present the raw functionality is great, but the PR
>> aspects are not so clear.
>
> Agreed. I did not appreciate the social networking aspects
> of it when I came up with the original design. I was
> envisioning something like a simple source code control
> system with the comments being terse rationales for the
> changes made in an edit, not the sort of discussion board
> it seems to be being used as.

I wasn't surprised, as the old system behaved a bit like that too.

> (As an aside, since the comments and references *are* generally
> useful, I would be happy to see them distributed in some form;
> if not in the xml, then perhaps as auxiliary files. Information
> accessible only from someone's web site puts the information
> somewhat at risk.)

I'm not sure about "distributed". Archived and visible, certainly.

> One of the things I've being wondering about, slightly looking
> into, is redoing the web pages with some kind of web framework.
> Such a framework would have a lot of features like authentication
> and sessions that I've implemented in a half-assed way.
> Probably email responses would be another feature so we might
> want to look into frameworks before spending a lot of time
> implementing email responses by hand first.

That would be great, but I fear it could be a major task.

> A mixed blessing with Python is that there in no canonoical
> package which means one can pick among a bunch of contenders
> with different strengths and weaknesses but that also means
> a big time commitment to evaluate them in depth.
>
> Another possibility would be to go up another level and use
> some kind of prebuilt discussion forum / social networking
> package into which the code for the database updates could
> be integrated. However, I have no idea what the options are.

I use a couple, and even administer an out-of-the-box phpBB
system. Something like that wouldn't be bad, and could largely
replace this list, but I don't think it could/should replace the
dialogue attached to the edits.

Anyway, I'll check down your TODO list. I suspect many are
things that perhaps only bother us occasionally.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4540 From: Nils Roland Barth <jdict.nbarth@...>
Date: Thu Sep 8, 2011 3:29 am
Subject: Entry Edit Form: Usability improvements (focus, reset, tab)
nils_barth
Send Email Send Email
 
Hi all,

I’d like to propose some usability improvements to the Entry Edit Form
and hear everyone’s thoughts on them (or other fixes, while we’re at it).

In brief:
1. Focus – initial focus in Kanji edit box

2. Reset button – eliminate Reset button on Entry Edit Form
    (but do *not* eliminate it on the Search pages)

3. Tab order – specify that Tab moves between the text entry areas
    (skipping Help hyperlinks)


These would make editing significantly faster for frequent editors,
as you wouldn’t need to use the mouse to select the initial focus
or move between the entry fields (just Tab).
It would also remove the danger of nuking a carefully written entry
by accidentally hitting “Reset”.


There’s been a bit of offline back-and-forth, which I’ve incorporated
below, and Jim suggested I bring this to the list.


In detail:

To follow along, click on “Enter New Entry” at WWWJDIC
or just see this link:
http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&c=1


1. Focus – initial focus in Kanji edit box

This seems straightforward – we already default focus to the
search box on the main search page and results page, and it would be
helpful on the Entry Edit Form as well.
(It’s one line of JavaScript.)


2. Reset button – eliminate Reset button on Entry Edit Form
    (but do *not* eliminate it on the Search pages)

As Stu and Jim remark, the Reset button on the Search form
(and other search pages) are *very* useful, due to the many check boxes.
This is about the reset button on Entry *Edit*, not Search.

The Reset button on the Edit page has bitten me once – I accidentally
tabbed to it and thus lost my carefully written entry – and does not
serve a useful purpose, I believe.
(I’ve never used it intentional, and am quite sure I won’t.)

Specifically:
* If you accidentally hit Reset, it deletes what you’ve written.

* For a new entry,
   rather than “Reset”, you can just click on “Enter New Entry” again,
   and this is what people normally do.

   Conceivably you could write a whole or partial entry, then
   decide to discard it, but this is more easily done by just closing
   the window. (Closing the window is also reversible/undoable in many
   browsers, such as Firefox, which saves form contents.)


* For an existing entry,
   I can’t see Reset ever being useful,
   because it erases everything (including the kanji and kana).

   If an entry should be deleted, there’s a separate check box for that;
   if an entry needs substantial rewriting, the relevant sections
   should be rewritten, and erasing everything doesn’t help this
   (e.g. “wait, what entry is this again?”).


Some v. noted (web) usability experts (Jakob Nielsen)
suggest against them on the above grounds (might hurt, not useful).

Here’s an article on this point.
Reset and Cancel Buttons (considered harmful)
http://www.useit.com/alertbox/20000416.html

As Stu remarks, expert advice and blanket recommendations shouldn’t be
accepted unquestioningly (e.g. the Reset on the Search forms is useful),
but in this case I think the advice is applicable – there’s the real
risk of losing meaningful data.



3. Tab order – specify that Tab moves between the text entry areas
    (skipping Help hyperlinks)

Currently the default tab order goes through many help links (unless
you are using Mac OS X (thanks René) or have a customized browser),
which makes tab navigation difficult – you have to use the mouse or
click Tab *a lot*.

For reference, see:
http://en.wikipedia.org/wiki/Tabbing_navigation


I’m suggesting making the tab order start at Kanji (using JavaScript,
as above), then make the tab order go from one entry box to the next,
ending at the “Next” button (confirmation/pre-submission button).
* After this have the help (by text field, then the tags)
   – so still relatively easily accesible,
* before this have the login field (so can Shift-Tab to login),
* and then after all this the default tab order goes through all
   remaining links in the page.

This would make navigating the edit form much faster for Tabbing users;
the idea is that most of the time you’re not referring to the help, or
can use the mouse to do so.

Many people don’t use Tab at all, so this doesn’t affect them.

The one downside of this, as Stu points out, is that keyboard-only users
who frequently refer to the help (e.g. new or infrequent contributors)
would have a harder time – the help will still be accessible, but more
out of the way.

This is an inescapable tradeoff, since different users have different needs.
Mobility-impaired users or frequent contributors can use various customizations
(e.g. keyboard-controlled mouse pointer, spatial navigation,
  overriding tabbing order),
but this is about setting the default behavior.

I don’t have statistics, but I suspect that:
* Many people don’t use Tab, so it doesn’t affect them either way
* Many people use Tab, and this would help them
* Impaired frequent editors are also better off with this direct
   tabbing
* Impaired infrequent editors are a small proportion of edits.
   (No stats, but I’d guess 0.1% to 1% of edits; I doubt it’s as much
    as 5–10%.)
…so for almost all users this would either make no difference or be a
big benefit.


Added details:
Since we’re presumably using JavaScript to set initial focus,
I’d suggest having the *login* and *corpus* inputs come *before* the
main kanji etc. entry forms, so one can Shift-Tab to go back to these
if they need changing (but by default start at Kanji and tab forward).

The field-by-field “Help” links should come next (so you can access
them directly after the form by tabbing past the end),
followed by the “Tags Help” (which are long), and then everything else
(default order, needn’t be specified).

I’d also suggest leaving the “delete this entry” checkbox out of this
“direct route” tab order, as you usually do *not* want to check this
(it’s rare to delete an entry); should usually be selected by mouse.



What do people think?
Agree, disagree, suggestions, improvements?

(For explicitness and ease, I’ve included implementation details below.)

best,
   ~nils


       %  %      %  %      %  %      %  %      %  %      %  %

Implementation

(Presumably straightforward, but to make it easier.)
These are changes to edform.py.


1. Focus

Following wwwjdic.cgi (search), which has the following code:

(in <head>:)
<SCRIPT type="text/javascript">
<!--
function sf(){document.inp.dsrchkey.focus();}
-->
</SCRIPT>

then:
<BODY onLoad="sf()" … >


The corresponding code in edform.py would be:

(in <head>:)
<SCRIPT type="text/javascript">
<!--
function kf(){document.edconf.kanj.focus();}
-->
</SCRIPT>

then:
<BODY onLoad="kf()" … >



2. Reset
This is just “remove the element”.


3. Tab order

This is set with the “tabindex” attribute.
http://www.w3.org/TR/html4/interact/forms.html#adef-tabindex

Since we have a few groups, it’s easier to do it in blocks
(leaving spaces) rather than sequentially, so one can rearrange the
page (add entries) w/o renumbering everything.
I’ve listed blocks of 10, which leaves some space;
I doubt we’ll be making many small changes (add/remove a field),
so this should be enough (major changes would need a large rewrite anyway).
Safer would be blocks of 20 or 100, if that’s preferred.


First the login boxes (Group 0):
Username / Password / Login
<input name="username" tabindex="1">
<... tabindex="3">


Then the main text forms (Group 1):
<select name="src" tabindex="10" ...> (corpus)
<textarea name="kanj" tabindex="11" ...>
...
<input type="submit" value=" Next " tabindex="18">


Then the field help (Group 2 – just add 10 to corresponding text field):
<a href="edhelp.py?svc=jmdict&sid=#corpus" class="helplink"
  tabindex="20" target="edhelp">help</a>
...


Then the tags help (Group 3 – start at 30, work up)
<a href="edhelp.py?svc=jmdict&sid=#kw_kinf" class="helplink"
  tabindex="30" target="edhelp">kanji info</a><br>

よろしくね!

#4541 From: Nils Roland Barth <jdict.nbarth@...>
Date: Thu Sep 8, 2011 3:33 am
Subject: Entry Edit Form: Usability improvements (focus, reset, tab)
nils_barth
Send Email Send Email
 
(Mail delivery seemed to cut off the bottom,
though whole message displays online at:
http://tech.groups.yahoo.com/group/edict-jmdict/message/4540
Forwarding cut off part again.)

   ~nils

----- Forwarded message from Nils Roland Barth <jdict.nbarth@...> -----

To: edict-jmdict@yahoogroups.com
Subject: Entry Edit Form: Usability improvements (focus, reset, tab)

<snip: Tabbing>

This is an inescapable tradeoff, since different users have different needs.
Mobility-impaired users or frequent contributors can use various customizations
(e.g. keyboard-controlled mouse pointer, spatial navigation,
  overriding tabbing order),
but this is about setting the default behavior.

I don’t have statistics, but I suspect that:
* Many people don’t use Tab, so it doesn’t affect them either way
* Many people use Tab, and this would help them
* Impaired frequent editors are also better off with this direct
   tabbing
* Impaired infrequent editors are a small proportion of edits.
   (No stats, but I’d guess 0.1% to 1% of edits; I doubt it’s as much
    as 5–10%.)
…so for almost all users this would either make no difference or be a
big benefit.


Added details:
Since we’re presumably using JavaScript to set initial focus,
I’d suggest having the *login* and *corpus* inputs come *before* the
main kanji etc. entry forms, so one can Shift-Tab to go back to these
if they need changing (but by default start at Kanji and tab forward).

The field-by-field “Help” links should come next (so you can access
them directly after the form by tabbing past the end),
followed by the “Tags Help” (which are long), and then everything else
(default order, needn’t be specified).

I’d also suggest leaving the “delete this entry” checkbox out of this
“direct route” tab order, as you usually do *not* want to check this
(it’s rare to delete an entry); should usually be selected by mouse.



What do people think?
Agree, disagree, suggestions, improvements?

(For explicitness and ease, I’ve included implementation details below.)

best,
   ~nils


       %  %      %  %      %  %      %  %      %  %      %  %

Implementation

(Presumably straightforward, but to make it easier.)
These are changes to edform.py.


1. Focus

Following wwwjdic.cgi (search), which has the following code:

(in <head>:)
<SCRIPT type="text/javascript">
<!--
function sf(){document.inp.dsrchkey.focus();}
-->
</SCRIPT>

then:
<BODY onLoad="sf()" … >


The corresponding code in edform.py would be:

(in <head>:)
<SCRIPT type="text/javascript">
<!--
function kf(){document.edconf.kanj.focus();}
-->
</SCRIPT>

then:
<BODY onLoad="kf()" … >



2. Reset
This is just “remove the element”.


3. Tab order

This is set with the “tabindex” attribute.
http://www.w3.org/TR/html4/interact/forms.html#adef-tabindex

Since we have a few groups, it’s easier to do it in blocks
(leaving spaces) rather than sequentially, so one can rearrange the
page (add entries) w/o renumbering everything.
I’ve listed blocks of 10, which leaves some space;
I doubt we’ll be making many small changes (add/remove a field),
so this should be enough (major changes would need a large rewrite anyway).
Safer would be blocks of 20 or 100, if that’s preferred.


First the login boxes (Group 0):
Username / Password / Login
<input name="username" tabindex="1">
<... tabindex="3">


Then the main text forms (Group 1):
<select name="src" tabindex="10" ...> (corpus)
<textarea name="kanj" tabindex="11" ...>
...
<input type="submit" value=" Next " tabindex="18">


Then the field help (Group 2 – just add 10 to corresponding text field):
<a href="edhelp.py?svc=jmdict&sid=#corpus" class="helplink"
  tabindex="20" target="edhelp">help</a>
...


Then the tags help (Group 3 – start at 30, work up)
<a href="edhelp.py?svc=jmdict&sid=#kw_kinf" class="helplink"
  tabindex="30" target="edhelp">kanji info</a><br>

よろしくね!

----- End forwarded message -----

#4542 From: Jim Breen <jimbreen@...>
Date: Wed Sep 14, 2011 1:38 am
Subject: Some correspondence formulae
breen_jim
Send Email Send Email
 
Greetings,

I can across a Japan Post page with a summary of the
many opening/closing passages used in letters.

http://www.post.japanpost.jp/navi/mame_dear.html

Some of these are in JMdict already, but if anyone has
time and would like to draft entries for the others, they'd
be very welcome.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4543 From: Ren Malenfant <rene_malenfant@...>
Date: Wed Sep 14, 2011 1:52 am
Subject: Re: Some correspondence formulae
reneneedsser...
Send Email Send Email
 
I'll handle any that have entries in the major dictionaries.  It's how I temporarily distract myself from polar bear genomics.


Rene


On 2011-09-13, at 7:38 PM, Jim Breen wrote:

 

Greetings,

I can across a Japan Post page with a summary of the
many opening/closing passages used in letters.

http://www.post.japanpost.jp/navi/mame_dear.html

Some of these are in JMdict already, but if anyone has
time and would like to draft entries for the others, they'd
be very welcome.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne



#4544 From: Jim Breen <jimbreen@...>
Date: Wed Sep 14, 2011 4:25 am
Subject: Re: Re: WWWJDIC's kanji details display
breen_jim
Send Email Send Email
 
On 17 August 2011 14:51, Francis Bond <bond@...> wrote:
> On 17 August 2011 09:23, Jim Breen <jimbreen@...> wrote:
>> Now that it's less cluttered, those "SOD" and "SODA" links look a bit ugly.
>> If I could find some little images of brushes, I'd see about putting
>> them in instead.
>
> There are some possible SVG images here:
http://www.openclipart.org/search/?query=brush&page=2

Thanks everyone for the image suggestions. I chose two small brushes.

I have now turned the revised format loose - it had been hanging
around for a month.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4545 From: "scott.edict" <scottn.canada@...>
Date: Wed Sep 14, 2011 2:26 pm
Subject: Re: Some correspondence formulae
scott.edict
Send Email Send Email
 
A bit offtopic, but have you ever considered adding a large number of entries by
parsing a corpus for the most frequent words not included in EDICT? I know that
the whole Japanese Wikipedia is available freely and some users here probably
have access to other commercial corpuses. I imagine it wouldn't be too hard for
some of the people here to generate a list of the words missing from EDICT in
order of frequency. Contributors could then add the words to the dictionary.
Just a thought.

Scott

--- In edict-jmdict@yahoogroups.com, Jim Breen <jimbreen@...> wrote:
>
> Greetings,
>
> I can across a Japan Post page with a summary of the
> many opening/closing passages used in letters.
>
> http://www.post.japanpost.jp/navi/mame_dear.html
>
> Some of these are in JMdict already, but if anyone has
> time and would like to draft entries for the others, they'd
> be very welcome.
>
> Jim
>
> --
> Jim Breen
> Adjunct Snr Research Fellow, Clayton School of IT, Monash University
> Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
> Graduate student: Language Technology Group, University of Melbourne
>

#4546 From: Jim Breen <jimbreen@...>
Date: Thu Sep 15, 2011 12:25 am
Subject: Re: Re: Some correspondence formulae
breen_jim
Send Email Send Email
 
On 15 September 2011 00:26, scott.edict <scottn.canada@...> wrote:
> A bit offtopic, but have you ever considered adding a large number of entries
by
> parsing a corpus for the most frequent words not included in EDICT?

The thought has accurred to me from time to time

(Go to http://www.cs.mu.oz.au/research/lt/ and look at the caption
under my picture,
about 1/4 of the way down.)

> I know that the whole Japanese Wikipedia is available freely

Indeed. I have been collecting a complete copy every 6 months or so, as
I will be doing some tracking of word usage over time.

> and some users  here probably have access to other commercial corpuses.
> I imagine it wouldn't be too hard for some of the people here to generate a
> list of the words missing from EDICT in order of frequency. Contributors could
> then add the words to the dictionary. Just a thought.

Parsing at the "word" level in Japanese is far from trivial. (A good place to
start is with a definition of "word".) The good parsers, such as Mecab, are
actually morphological analyzers, which means they go below the "word"
level. Quite a lot of work is needed if you want to concentrate on the sorts
of words/compounds/multi-words/etc. that actually get into dictionaries.
The lexeme annotation exercise I have posted about here (more on that
later today), is part of testing a system for such "word" identification.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4547 From: Jim Breen <jimbreen@...>
Date: Thu Sep 15, 2011 5:08 am
Subject: Japanese Sentence Annotation
breen_jim
Send Email Send Email
 
[Apologies if you get this twice - I sent it earlier to people
who have carried out annotations.]

Hi,

I want to provide some feedback on where we have got to
in the Japanese sentence annotation exercise. It's still
going, although it's slowed down a lot, and I need to
keep it moving.

Of the 2,000 sentences, 749 have been completely processed,
i.e. checked and possibly annotated by two (or more)
people. Of the remaining 1,251, 621 have been checked by one
person (me), and 631 have not been marked yet. I am pushing
on with them, but I really need people to join in. The field
is clear - the 1,251 remaining have not been marked by
anyone but me.

[For interest, of the 749 that have been cleared, 222 had
no additional lexemes, in 315 there were extra lexemes that
both annotators agreed about, and in 212 the annotators
disagreed and I need to organize an adjudication.]

In coming weeks I will continue both to be first annotator
of as-yet unmarked sentences, and I can be second annotator
when others mark a sentence. (I have scripts monitoring this.)

I know this is an imposition, but I hope people can find the
time to look at some more sentences.

The URL for the annotation site is:

http://www.csse.monash.edu.au/~jwb/cgi-bin/annotate/filelist.cgi

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4548 From: "scott.edict" <scottn.canada@...>
Date: Thu Sep 15, 2011 2:47 pm
Subject: Re: Some correspondence formulae
scott.edict
Send Email Send Email
 
> On 15 September 2011 00:26, scott.edict <scottn.canada@...> wrote:
> > A bit offtopic, but have you ever considered adding a large number of
entries by
> > parsing a corpus for the most frequent words not included in EDICT?
>
> The thought has accurred to me from time to time
>
> (Go to http://www.cs.mu.oz.au/research/lt/ and look at the caption
> under my picture,
> about 1/4 of the way down.)

Yes, I had some notion that you were working on this because of your related
lexeme annotation website. But missing words need not be neologisms.

Another approach could be to find words included in
Daijirin/Daijisen/Nikkoku/GG5/Kojien/etc. but not present in EDICT. This could
be done with a simple comparison. Their frequency could then be determined from
a search in corpora, or using Google n-grams. This would not find any
neologisms, but simply tell us what most frequent words/spellings are missing
from EDICT.

>
> > I know that the whole Japanese Wikipedia is available freely
>
> Indeed. I have been collecting a complete copy every 6 months or so, as
> I will be doing some tracking of word usage over time.
>
> > and some users  here probably have access to other commercial corpuses.
> > I imagine it wouldn't be too hard for some of the people here to generate a
> > list of the words missing from EDICT in order of frequency. Contributors
could
> > then add the words to the dictionary. Just a thought.
>
> Parsing at the "word" level in Japanese is far from trivial. (A good place to
> start is with a definition of "word".) The good parsers, such as Mecab, are
> actually morphological analyzers, which means they go below the "word"
> level. Quite a lot of work is needed if you want to concentrate on the sorts
> of words/compounds/multi-words/etc. that actually get into dictionaries.
> The lexeme annotation exercise I have posted about here (more on that
> later today), is part of testing a system for such "word" identification.
>
> Cheers
>
> Jim
>
> --
> Jim Breen
> Adjunct Snr Research Fellow, Clayton School of IT, Monash University
> Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
> Graduate student: Language Technology Group, University of Melbourne
>

Thanks for your answer and I hope that you do find a good way to identify
neologisms.

Scott

#4549 From: Jim Breen <jimbreen@...>
Date: Thu Sep 15, 2011 11:10 pm
Subject: Re: Re: Some correspondence formulae
breen_jim
Send Email Send Email
 
On 16 September 2011 00:47, scott.edict <scottn.canada@...> wrote:
>> (Go to http://www.cs.mu.oz.au/research/lt/ and look at the caption
>> under my picture,
>> about 1/4 of the way down.)
>
> Yes, I had some notion that you were working on this because of
> your related lexeme annotation website. But missing words need
> not be neologisms.

I'm stretching the meaning of "neologisms" to include things that are
not in dictionaries, but could/should be.

> Another approach could be to find words included in Daijirin/Daijisen/
> Nikkoku/GG5/Kojien/etc. but not present in EDICT. This could be done
> with a simple comparison. Their frequency could then be determined
> from a search in corpora, or using Google n-grams. This would not
> find any neologisms, but simply tell us what most frequent words/spellings
> are missing from EDICT.

That would be an interesting thing to do, but as an
application/project; not research. In the course of my
work I have collected a lot of material, including file
copies of GG5, Daijirin, Kojien, etc. and built a number
of tools that could be used for the sort of task you mention.
Also the big UniDic lexicon, which I use with MeCab, has
a number of otherwise not recorded words and word variants.

I have to push all thought of heading in that direction aside,
and concentrate on my main goal for the next few years, but
after that, if someone hasn't done it first, I'd like to build an
online dictionary-entry-builder that could be used to construct
skeletal JE entries from multiple sources ready for human
edit. The greater the mixing and dilution, the less the problem
of violating the copyright of existing dictionaries.

Of course if anyone wants to put together such a beastie, I
won't object. In fact I could contribute some resources.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4550 From: Ben Bullock <benkasminbullock@...>
Date: Fri Oct 7, 2011 5:11 am
Subject: O'Neill "Japanese Names" bugs in Kanjidic
benkasminbul...
Send Email Send Email
 
Looking at kanjidic,

 3D6B U6691 B72 G3 S12 XO1996 F1442 J3 N2138 V2484 H2473 DK1600 L1260
K1254 O1738A DO476 MN14031P MP5.0911 E313 IN638 DS247 DF190 DH329
DT400 DJ119 DG976 DM1268 P2-4-8 I4c8.5 Q6060.4 DR3878 ZPP3-8-4 Yshu3
Wseo  . {sultry} {hot} {summer heat}

there is an entry here for XO1996, but in the original book entry 1996
just points back to O1738A, there is no reference from 1738A to 1996,
which was a misclassification of the same kanji rather than a
cross-reference to a different one.

I also found a list of other dead-end cross-references:

Ф [H1960] not found
׮ [H3394] not found
 [H3396] not found
 [H3444] not found
 [O433] not found
 [H2446] not found
ް [O810] not found
ɱ [O1598] not found
 [O1379] not found
 [H2268] not found
 [O1552] not found
δ [O1562] not found
 [O1968] not found
 [O1556] not found
 [O1996] not found
 [O2078A] not found
 [O2507] not found
 [H2889] not found
 [O2369] not found
 [H2147] not found
ѿ [H2101] not found

Spot-checking,

 [O433] not found

is just a record of a mistake. O433 is just a non-existent kanji, and
there is no pointer FROM  to 433 in the original book.

O810 is ް kanji which is in kanjidic, but it is mislabelled:

ް 5E30 U6d0c B85 S9 XJ05158 XO810 N2536 V3115 O549 MN17364 MP6.1086
P1-3-6 I3a6.8 Q3210.0 Ylie4 Wryeol  쥤 .
{pure}

O549 is actually  (ice radical) not ް (water radical). For some
reason it's given as a cross-reference above.

#4551 From: Jim Breen <jimbreen@...>
Date: Mon Oct 10, 2011 9:49 pm
Subject: Re: O'Neill "Japanese Names" bugs in Kanjidic
breen_jim
Send Email Send Email
 
Hi Ben,

I'll have a look at them when I am back in Australia
and have my references. (I am in Japan at present,
with only limited net access.).

There's quite a backlog of kanjidic amendments.

Jim


2011/10/7 Ben Bullock <benkasminbullock@...>:
> Looking at kanjidic,
>
>  3D6B U6691 B72 G3 S12 XO1996 F1442 J3 N2138 V2484 H2473 DK1600 L1260
> K1254 O1738A DO476 MN14031P MP5.0911 E313 IN638 DS247 DF190 DH329
> DT400 DJ119 DG976 DM1268 P2-4-8 I4c8.5 Q6060.4 DR3878 ZPP3-8-4 Yshu3
> Wseo  . {sultry} {hot} {summer heat}
>
> there is an entry here for XO1996, but in the original book entry 1996
> just points back to O1738A, there is no reference from 1738A to 1996,
> which was a misclassification of the same kanji rather than a
> cross-reference to a different one.
>
> I also found a list of other dead-end cross-references:
>
> Ф [H1960] not found
> ׮ [H3394] not found
>  [H3396] not found
>  [H3444] not found
>  [O433] not found
>  [H2446] not found
> ް [O810] not found
> ɱ [O1598] not found
>  [O1379] not found
>  [H2268] not found
>  [O1552] not found
> δ [O1562] not found
>  [O1968] not found
>  [O1556] not found
>  [O1996] not found
>  [O2078A] not found
>  [O2507] not found
>  [H2889] not found
>  [O2369] not found
>  [H2147] not found
> ѿ [H2101] not found
>
> Spot-checking,
>
>  [O433] not found
>
> is just a record of a mistake. O433 is just a non-existent kanji, and
> there is no pointer FROM  to 433 in the original book.
>
> O810 is ް kanji which is in kanjidic, but it is mislabelled:
>
> ް 5E30 U6d0c B85 S9 XJ05158 XO810 N2536 V3115 O549 MN17364 MP6.1086
> P1-3-6 I3a6.8 Q3210.0 Ylie4 Wryeol  쥤 .
{pure}
>
> O549 is actually  (ice radical) not ް (water radical). For some
> reason it's given as a cross-reference above.
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>



--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4552 From: Alexandru Pojoga <apojoga@...>
Date: Sun Oct 23, 2011 1:09 pm
Subject: Directly link to an entry in WWWJDIC
alexpojoga
Send Email Send Email
 
Hi,

Is there a way to get a direct link to an Edict entry on WWWJDIC? I've
seen them in the wild but wasn't able to find a way to do it from the
interface, for the life of me.
All I get is http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E

Thank you,
Alex

#4553 From: Hans-Jrg Bibiko <bibiko@...>
Date: Sun Oct 23, 2011 1:18 pm
Subject: Re: Directly link to an entry in WWWJDIC
hansjoergbibiko
Send Email Send Email
 
On 23 Oct 2011, at 15:09, Alexandru Pojoga wrote:

> Is there a way to get a direct link to an Edict entry on WWWJDIC? I've
> seen them in the wild but wasn't able to find a way to do it from the
> interface, for the life of me.
> All I get is http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E

Hi Alexandru,



the URL is http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MKUxxxx

whereby xxxx has to be replaced by the kanji's hexadecimal ucs2 code (Unicode
code).

For example the kanji 二 (ni - "two") has the hex code 4E8C then you can link
to it via:

http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MKU4E8C

or to the raw data via:

http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1ZKU4E8C


Note: see also http://www.csse.monash.edu.au/~jwb/wwwjdicinf.html#backdoor_tag
(but take the new URL domain)


Kind regards,
--Hans

**********************************************************
Hans-Joerg Bibiko
Max Planck Institute for Evolutionary Anthropology
Department of Linguistics
Deutscher Platz 6     phone:   +49 (0) 341 3550 341
D-04103 Leipzig       fax:     +49 (0) 341 3550 333
Germany               e-mail:  bibiko[-at-]eva.mpg.de
**********************************************************

#4554 From: Jim Breen <jimbreen@...>
Date: Sun Oct 23, 2011 10:39 pm
Subject: Re: Directly link to an entry in WWWJDIC
breen_jim
Send Email Send Email
 
Hi Alexandru

On 24 October 2011 00:09, Pojoga <apojoga@...> wrote:

> Is there a way to get a direct link to an Edict entry on WWWJDIC? I've
> seen them in the wild but wasn't able to find a way to do it from the
> interface, for the life of me.
> All I get is http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E

The method Hans-Jörg mentioned is correct for kanjidic entries, but going
for an edict entry needs a bit more.

Where a Japanese kanji-part or yomikata is unique, e.g. for 店を開く, you
can use
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MUJ%E5%BA%97%E3%82%92%E9\
%96%8B%E3%81%8F
(change the site URL to suit the server). That example is
in UTF8, but other codings are possible.

Where the Japanese key is not unique, you may get several entries displayed
by WWWJDIC. A way around that is to specify the JMdict ent-seq. In the
version of EDICT used by WWWJDIC, it is coded as "EntLnnnnnnn", so
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MDEentl1910280
will display just that entry. To find out the ent-seq numbers, you'd
probably need
to look at a copy of JMdict. Alternatively find the entry you want in WWWJDIC,
then look at the HTML of the page. The ent-seq is in the second sub-field of the
"jukugosel" radio button coding. Of just click on the "Edit" link and
see what entry
comes up the the edit form.

HTH

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4555 From: Mark R <kiwitotoro2000@...>
Date: Tue Oct 25, 2011 2:30 pm
Subject: Can you please review my site?
kiwitotoro2000
Send Email Send Email
 
Hello,
I Joined the group a while back and was asking about if custom changes to the
lists were viable, Ididn'texplain myself very well but I have a (slightly)
working version of my site up online. feel free to have a look and tell me what
you think (if you go to the site be aware it will download a 7mb font) the url
is: 2hon5.net

It uses the two jis encoded character files and the word dictionary to generate
the output on the home page, there is a log in but there is not much point as it
dose not change the functionality.

I want to eventually make it so a user can have a working list of known
words/characters and simplify hard Japanese based on those settings. Any feed
back would be greatly appreciated. Also when you hover over a character there is
an error button, I was thinking that could link back to your system (in the
future when mine works properly) but Idon'tthink it should make a change
request on your system because obviously it is aimed at Japanese learners) Where
people could report if there is a word not in the dictionary for example.

Regards

Mark

#4556 From: Alexandru Pojoga <apojoga@...>
Date: Tue Oct 25, 2011 6:19 pm
Subject: Re: Directly link to an entry in WWWJDIC
alexpojoga
Send Email Send Email
 
Thank you, very much appreciated!

Sincerely,
Alex


2011/10/24 Jim Breen <jimbreen@...>
 

Hi Alexandru



On 24 October 2011 00:09, Pojoga <apojoga@...> wrote:

> Is there a way to get a direct link to an Edict entry on WWWJDIC? I've
> seen them in the wild but wasn't able to find a way to do it from the
> interface, for the life of me.
> All I get is http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E

The method Hans-Jörg mentioned is correct for kanjidic entries, but going
for an edict entry needs a bit more.

Where a Japanese kanji-part or yomikata is unique, e.g. for 店を開く, you can use
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MUJ%E5%BA%97%E3%82%92%E9%96%8B%E3%81%8F
(change the site URL to suit the server). That example is
in UTF8, but other codings are possible.

Where the Japanese key is not unique, you may get several entries displayed
by WWWJDIC. A way around that is to specify the JMdict ent-seq. In the
version of EDICT used by WWWJDIC, it is coded as "EntLnnnnnnn", so
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MDEentl1910280
will display just that entry. To find out the ent-seq numbers, you'd
probably need
to look at a copy of JMdict. Alternatively find the entry you want in WWWJDIC,
then look at the HTML of the page. The ent-seq is in the second sub-field of the
"jukugosel" radio button coding. Of just click on the "Edit" link and
see what entry
comes up the the edit form.

HTH

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne



#4557 From: Nils Roland Barth <jmdict.nbarth@...>
Date: Wed Nov 2, 2011 6:52 am
Subject: Meaning of [iK] (irregular kanji)?
nils_barth
Send Email Send Email
 
Hi all,

What’s the precise meaning of the kanji tag [iK]
(irregular kanji)?

Does it mean
“not listed in major dictionaries,
  or listed with a triangle”

I’m thinking concretely of 思う here,
which is the only spelling given in dictionaries,
but one often sees 想う used (also in derived terms),
and various lesser-used variants (憶う etc.) are also
listed at JMdict.


In more detail:

Concretely, dictionaries may list one or more spellings,
and may mark some with a triangle as “kinda off”.
Other spellings are found in actual use with different
degrees of frequency.


Specifically, kanji spellings can be:
* standard – best practice (one or more)
* accepted variants (not preferred, but ok)
* commonly used variants, but considered irregular/incorrect/iffy
* less commonly used variants
* completely idiosyncratic ones (used by one or a few authors)

Setting aside [oK] (old) characters, we’ve 3 levels in the dict:
(P) (preferred)
() (nothing, ok)
[iK] (irregular)
(Note that normal editing can’t change (P) – that’s separate.)

My sense is:
* If only one spelling, unmarked – it’s the one correct spelling
* If multiple spellings and some are preferred, mark preferred with (P)
* If multiple spellings and some are not ok, mark [iK]

…where this is decided basically as:
* Any spellings given as head entries in one or more major dicts
   (or consensus if debatable) are (P)
* 交ぜ書き (writing some kanji as kana) doesn’t get a (P),
   but isn’t marked.
* Spellings that are listed in major dicts but marked with a
   triangle, or that are not listed but appear in many
   sources (e.g., via Google hits) or are discussed in usage books
   or articles
   – are included
   – are *always* marked with an [iK].
   – …and perhaps get a (brief) note on usage.
* Idiosyncratic spellings (used by one or few people)
   are not included.

Specific points:
* Weird spellings don’t need to be listed in dictionaries,
   and indeed often won’t.
* We don’t distinguish between *how* unusual a spelling is,
   but just list in order:
   e.g., 想う is a pretty common variant of 思う,
   but 憶う seems rarer,
   so they are listed in that order, but not otherwise marked.

(Similarly, 呑む is a v. common variant of 飲む,
  but 服む and 煙む are pretty unusual.)

Is this correct? Am I missing anything?

best,
   ~nils

#4558 From: Ren Malenfant <rene_malenfant@...>
Date: Wed Nov 2, 2011 7:31 am
Subject: Re: Meaning of [iK] (irregular kanji)?
reneneedsser...
Send Email Send Email
 
I interpret it as "the reading is wrong", i.e. it does not appear as a headword in any major kokugo dictionary and does not have that reading in a kanwa dictionary.  Otherwise it can be used when "the reading is right but it doesn't carry the meaning suggested by the English glosses", but arguably this could be called be [ateji] in some cases.

The kanji you marked as [iK] in おもう are present in multiple kokugo dictionaries (Kojien, Daijirin, Daijisen, Meikyo), and appear in both of my kanwa dictionaries with that reading-meaning combo.  I have unmarked them as [iK].

White triangles in Daijirin/Daijisen simply mean "non-joyo kanji reading".

Rene


On 2011-11-02, at 12:52 AM, Nils Roland Barth wrote:

 

Hi all,

What’s the precise meaning of the kanji tag [iK]
(irregular kanji)?

Does it mean
“not listed in major dictionaries,
or listed with a triangle”

I’m thinking concretely of 思う here,
which is the only spelling given in dictionaries,
but one often sees 想う used (also in derived terms),
and various lesser-used variants (憶う etc.) are also
listed at JMdict.

In more detail:

Concretely, dictionaries may list one or more spellings,
and may mark some with a triangle as “kinda off”.
Other spellings are found in actual use with different
degrees of frequency.

Specifically, kanji spellings can be:
* standard – best practice (one or more)
* accepted variants (not preferred, but ok)
* commonly used variants, but considered irregular/incorrect/iffy
* less commonly used variants
* completely idiosyncratic ones (used by one or a few authors)

Setting aside [oK] (old) characters, we’ve 3 levels in the dict:
(P) (preferred)
() (nothing, ok)
[iK] (irregular)
(Note that normal editing can’t change (P) – that’s separate.)

My sense is:
* If only one spelling, unmarked – it’s the one correct spelling
* If multiple spellings and some are preferred, mark preferred with (P)
* If multiple spellings and some are not ok, mark [iK]

…where this is decided basically as:
* Any spellings given as head entries in one or more major dicts
(or consensus if debatable) are (P)
* 交ぜ書き (writing some kanji as kana) doesn’t get a (P),
but isn’t marked.
* Spellings that are listed in major dicts but marked with a
triangle, or that are not listed but appear in many
sources (e.g., via Google hits) or are discussed in usage books
or articles
– are included
– are *always* marked with an [iK].
– …and perhaps get a (brief) note on usage.
* Idiosyncratic spellings (used by one or few people)
are not included.

Specific points:
* Weird spellings don’t need to be listed in dictionaries,
and indeed often won’t.
* We don’t distinguish between *how* unusual a spelling is,
but just list in order:
e.g., 想う is a pretty common variant of 思う,
but 憶う seems rarer,
so they are listed in that order, but not otherwise marked.

(Similarly, 呑む is a v. common variant of 飲む,
but 服む and 煙む are pretty unusual.)

Is this correct? Am I missing anything?

best,
~nils



#4559 From: Nils Roland Barth <jmdict.nbarth@...>
Date: Wed Nov 2, 2011 7:45 am
Subject: Re: Meaning of [iK] (irregular kanji)?
nils_barth
Send Email Send Email
 
Thanks René!

(Some of the dictionaries, esp. 大辞林, often buries alt. spellings
in the entry – despite including several in the heading! – so I missed
it when not looking carefully)

I don’t have a 漢和辞典 (and am not familiar with all the
possible readings), so I’ll be careful about [iK] markings
(i.e., largely refrain).

   ~nils

René Malenfant:
> I interpret it as "the reading is wrong", i.e. it does not appear as a
headword in any major kokugo dictionary and does not have that reading in a
kanwa dictionary.  Otherwise it can be used when "the reading is right but it
doesn't carry the meaning suggested by the English glosses", but arguably this
could be called be [ateji] in some cases.
>
> The kanji you marked as [iK] in おもう are present in multiple kokugo
dictionaries (Kojien, Daijirin, Daijisen, Meikyo), and appear in both of my
kanwa dictionaries with that reading-meaning combo.  I have unmarked them as
[iK].
>
> White triangles in Daijirin/Daijisen simply mean "non-joyo kanji reading".
>
> Rene

#4560 From: Jim Breen <jimbreen@...>
Date: Wed Nov 2, 2011 11:40 am
Subject: Re: Meaning of [iK] (irregular kanji)?
breen_jim
Send Email Send Email
 
I agree with Rene, although I would word it a little differently. I'd
say it used
when it's the "wrong" kanji, i.e. it has different reading or meaning,
but for some
reason it is being used. The reasons could be:
- it looks like the correct kanji, and is a common mistake;
- it has the more-or-less correct meaning
- it is a visual pun;
- etc. etc.

As an example of a pun, see the entry for おふくろ:

お袋(P); 御袋; お母(iK) 【おふくろ】 (n) (col) one's mother

There is no way 母 is read ふくろ. It's all a bit tongue in cheek.

Another example is:

圧濾器; 圧瀘器(iK) 【あつろき】 (n) filter press

瀘 looks a bit like 濾. It even has the same reading, but its meaning
is nothing to do with filters.

HTH

Jim

On 2 November 2011 18:31, René Malenfant <rene_malenfant@...> wrote:
> I interpret it as "the reading is wrong", i.e. it does not appear as a
headword in any major kokugo dictionary and does not have that reading in a
kanwa dictionary.  Otherwise it can be used when "the reading is right but it
doesn't carry the meaning suggested by the English glosses", but arguably this
could be called be [ateji] in some cases.
> The kanji you marked as [iK] in おもう are present in multiple kokugo
dictionaries (Kojien, Daijirin, Daijisen, Meikyo), and appear in both of my
kanwa dictionaries with that reading-meaning combo.  I have unmarked them as
[iK].
> White triangles in Daijirin/Daijisen simply mean "non-joyo kanji reading".
> Rene
>
> On 2011-11-02, at 12:52 AM, Nils Roland Barth wrote:
>
>
>
> Hi all,
>
> What’s the precise meaning of the kanji tag [iK]
> (irregular kanji)?
>
> Does it mean
> “not listed in major dictionaries,
> or listed with a triangle”
>
> I’m thinking concretely of 思う here,
> which is the only spelling given in dictionaries,
> but one often sees 想う used (also in derived terms),
> and various lesser-used variants (憶う etc.) are also
> listed at JMdict.
>
> In more detail:
>
> Concretely, dictionaries may list one or more spellings,
> and may mark some with a triangle as “kinda off”.
> Other spellings are found in actual use with different
> degrees of frequency.
>
> Specifically, kanji spellings can be:
> * standard – best practice (one or more)
> * accepted variants (not preferred, but ok)
> * commonly used variants, but considered irregular/incorrect/iffy
> * less commonly used variants
> * completely idiosyncratic ones (used by one or a few authors)
>
> Setting aside [oK] (old) characters, we’ve 3 levels in the dict:
> (P) (preferred)
> () (nothing, ok)
> [iK] (irregular)
> (Note that normal editing can’t change (P) – that’s separate.)
>
> My sense is:
> * If only one spelling, unmarked – it’s the one correct spelling
> * If multiple spellings and some are preferred, mark preferred with (P)
> * If multiple spellings and some are not ok, mark [iK]
>
> …where this is decided basically as:
> * Any spellings given as head entries in one or more major dicts
> (or consensus if debatable) are (P)
> * 交ぜ書き (writing some kanji as kana) doesn’t get a (P),
> but isn’t marked.
> * Spellings that are listed in major dicts but marked with a
> triangle, or that are not listed but appear in many
> sources (e.g., via Google hits) or are discussed in usage books
> or articles
> – are included
> – are *always* marked with an [iK].
> – …and perhaps get a (brief) note on usage.
> * Idiosyncratic spellings (used by one or few people)
> are not included.
>
> Specific points:
> * Weird spellings don’t need to be listed in dictionaries,
> and indeed often won’t.
> * We don’t distinguish between *how* unusual a spelling is,
> but just list in order:
> e.g., 想う is a pretty common variant of 思う,
> but 憶う seems rarer,
> so they are listed in that order, but not otherwise marked.
>
> (Similarly, 呑む is a v. common variant of 飲む,
> but 服む and 煙む are pretty unusual.)
>
> Is this correct? Am I missing anything?
>
> best,
> ~nils
>
>
>
>


--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4561 From: Nils Roland Barth <jmdict.nbarth@...>
Date: Wed Nov 2, 2011 12:06 pm
Subject: Re: Meaning of [iK] (irregular kanji)?
nils_barth
Send Email Send Email
 
Hi Jim,

Thanks – “wrong/mistakes” is clear,
but I’ve got a question about puns and [iK] vs. [ateji] (below).


That does help – so the point is it’s not for:
* less common but accepted spellings (like 憶う for 思う),
but basically for:
* mistakes/misspellings
   (looks roughly right, meaning roughly right, etc.).
As you say, for *wrong* kanji.


Regarding punning, how do we distinguish this from ateji?

I ask b/c, AFAICT, in Japanese “ateji” is used broadly to mean
both “using characters for sound only” and “using characters
for meaning only” (more generally, assigning kanji w/o regard
to the etymology); in English it’s often used only narrowly
to refer to sound-only (like 天麩羅 てんぷら).

I’m basing this on the ja WP:
http://ja.wikipedia.org/wiki/当て字
(After “reading only, meaning disregarded”, it continues:)
「漢字の読み方を無視し、字義のみを考慮して漢字を当て\
場合。」
Cases where disregarding the character’s reading, one uses the
character considering only the meaning.

「日本語の熟字訓も含まれる。」
Jukujikun are also included.

OTOH, we seem to use “ateji“ just for sound
http://www.csse.monash.edu.au/~jwb/wwwjdicinf.html#code_tag
ateji   kanji used as phonetic symbol(s)
(Not sure if there’s a general term for “sound ateji”.)

There’s also:
gikun   gikun (meaning) reading
…which seems to be the other kind of ateji.
(I’m not sure exactly what “gikun” means in Japanese;
  it seems to be used for idiosyncratic punning.
  I’m also not sure if it generally falls under ateji,
  though it seems to be a special case.)
FWIW, WP’s take:
http://ja.wikipedia.org/wiki/義訓


So while this example:
> 圧濾器; 圧瀘器(iK) 【あつろき】 (n) filter press
…is clearly a mistake (hence unambiguously [iK]),

this example:
> お袋(P); 御袋; お母(iK) 【おふくろ】 (n) (col) one's mother
…seems like it could be considered ateji (broadly),
or gikun (in the sense we’re using it).

So to state it simply:
* what do we mean by [ateji] and [gikun]
   Do they cover punny cases, or how are they distinguished from [iK]?

   ~nils


Jim Breen:
> I agree with Rene, although I would word it a little differently. I'd
> say it used
> when it's the "wrong" kanji, i.e. it has different reading or meaning,
> but for some
> reason it is being used. The reasons could be:
> - it looks like the correct kanji, and is a common mistake;
> - it has the more-or-less correct meaning
> - it is a visual pun;
> - etc. etc.
>
> As an example of a pun, see the entry for おふくろ:
>
> お袋(P); 御袋; お母(iK) 【おふくろ】 (n) (col) one's mother
>
> There is no way 母 is read ふくろ. It's all a bit tongue in cheek.
>
> Another example is:
>
> 圧濾器; 圧瀘器(iK) 【あつろき】 (n) filter press
>
> 瀘 looks a bit like 濾. It even has the same reading, but its meaning
> is nothing to do with filters.
>
> HTH
>
> Jim

#4562 From: Stuart McGraw <smcg4191@...>
Date: Wed Nov 2, 2011 3:26 pm
Subject: Re: Meaning of [iK] (irregular kanji)?
smcgr4444
Send Email Send Email
 
On 11/02/2011 12:52 AM, Nils Roland Barth wrote:
>[...]
> Setting aside [oK] (old) characters, we’ve 3 levels in the dict:
> (P) (preferred)
> () (nothing, ok)
> [iK] (irregular)
> (Note that normal editing can’t change (P) – that’s separate.)
>
> My sense is:
> * If only one spelling, unmarked – it’s the one correct spelling
> * If multiple spellings and some are preferred, mark preferred with (P)
> * If multiple spellings and some are not ok, mark [iK]
>
> …where this is decided basically as:
> * Any spellings given as head entries in one or more major dicts
>   (or consensus if debatable) are (P)

Just a comment on my interpretation of your interpretation
of the "P" tag... :-)  I think of it as "popular" and not "preferred".
It distinguishes (at least ideally) between very commonly used words
and the vast majority which are not used exceptionally commonly, across
the entire set of entries in the dictionary.
It only indirectly distinguishes between different spellings of the
same word in so far as those other spellings are not common among
all entries; commonness within a given entry is not relevant to a
spelling being tagged with a "P".)

The P tag is actually a composite derived from the other frequency-
of-use tags, ichi1-2, gai1-2, nf1-32, etc, which in turn come (mostly)
from published sources.  More details are in the "Word Priority Marking"
section of http://www.csse.monash.edu.au/~jwb/edict_doc.html or in the
comments at the top of the JMdict XML file.  So,

   > Any spellings given as head entries in one or more major dicts
   > (or consensus if debatable) are (P)

is not really true.

As always, this is subject to correction by those knowing more
than me about it.

#4563 From: Nils Roland Barth <jmdict.nbarth@...>
Date: Thu Nov 3, 2011 10:53 am
Subject: Re: Meaning of [iK] (irregular kanji)? (+ Non-Joyo *readings*)
nils_barth
Send Email Send Email
 
Hi Stuart, (and all)

Concrete question/proposal:
* Could we mark non-Joyo *readings*?
(Currently non-Joyo *characters* are marked in purple 人名用
  or green 表外字 – this is v. useful.)

As René notes, this is done via a triangle in some dictionaries,
and presumably this could be determined automatically,
assuming we have a list of Joyo readings.

This would be useful in (automatically) flagging potentially
confusing readings.

As a first step, we’d need to assign grades to *readings*
in the kanji dic (currently it has grades for *characters*,
and sorts readings by on/kun/name, but doesn’t grade readings).


Together with figuring out what reading is implied by a spelling,
this should allow us to automatically mark:
* non-Joyo readings (needs grading)
* non-standard readings (specific category of [iK]) (doable already)
…and also presumably:
* non-standard okurigana usage?
(Presumably able to be done automatically.)

I’m not proposing this be done now – this is a lot of work – but it
would be an interesting project longer-term.


<snip: Stuart explains (P)>

Thanks for clearing that up – so:
* (P) is “Popularity” (at the level of a giving spelling),
   determined by given sources
   (hence not editable)
   It’s about absolute popularity of words (concretely,
   of given strings of characters), not relative popularities
   of spellings.

* [iK] is (as René and Jim wrote) for *errors* or irregular
   uses, determined manually by referring to dictionaries
(This can be semi-automated by using list of readings
  in the kanji dict to find pronunciations that don’t work,
  but semantic errors/mismatches require human judgment.)

As a concrete example of an “expressive” [iK],
writing のむ(飲む/呑む) as 服む (for taking medicine)
(which I added and marked as [iK]) seems to fit:
Both JMdict and 広辞苑 list only the 音読み 「フク」
and no 訓読み so (by this criterion) it’s [iK]
b/c it’s not a valid reading.


More detailed thoughts on popularity:

Spellings should be ordered in list of popularity, but
beyond that there’s no indication of *how* much more
popular a spelling is.


For example, crudely Googling for lemma forms of おもう 思う yields:
* 思う 1,320,000,000
* 想う    27,000,000
* 念う        58,700

I.e., there’s about 2 or 3 orders of magnitude (1.7 & 2.7)
difference in frequency of these spellings, which I’d
summarize as:
* 思う is the standard spelling
* 想う is a reasonably common variant
   (e.g., typical native speakers would recognize and may use it)
* 念う is pretty uncommon, but accepted
   (e.g., some native speakers probably wouldn’t recognize it,
    and would be sensible to use furigana)

This is partly reflected in standards: only 思う is a Joyo reading
(others marked with triangle in my 大辞泉), but OTOH I don’t
know how you’d know other than by Googling that 想う is much more
common than 念う – maybe 漢検 level?

This is admittedly rather fine-grained popularity
information, and rather laborious to determine and tricky to
present, but it would be nice to include somehow someday.

Referring to standards (what grade is a reading – e.g.,
is it in Joyo? What level on the kanken?) gives a clear
and objective way to do this w/o reinventing the wheel;
real-world popularity would be interesting but lots of work.


   ~nils

#4564 From: Alexandre Courbot <gnurou@...>
Date: Thu Nov 3, 2011 2:51 pm
Subject: JMdict internationalization effort - let's (finally) do it!
gnurou
Send Email Send Email
 
Hi everybody,

I come up with this topic about once every year, so since 2011 is
coming to an end I thought I should bring it back on the table. ;)

Some may remember (3 years ago already) that, as the writer of a
software that uses JMdict but is also used by non-English speakers
(many French people notably), I got lots of requests for more complete
French translations in the dictionary. This raised my interest as to
how current translations of JMdict are handled and how it is possible
to contribute to them. If I remember correctly, Jim is currently
handling translations through various files of various formats and
merges them into the JMdict file (the same applies to kanjidic2),
which makes it hard and inconsistent to maintain. Thanks to the great
work done by Stuart, we now have a good way to add new entries and
amend existing ones, but it still does not handle translations in
languages other than English. At the same time, it is perfectly
understandable that JMdict, as a project, wants to focus on English
instead of spreading into as many directions as there are spoken
languages on the planet.

So at that time I thought it would be nice to have an interface
similar to what Stuart did directed at translators of the JMdict, so
that people can collaboratively translate the dictionary in other
languages, à la Tatoeba. This would allow to move all existing
translations into a single format (which would probably simplify Jim's
life), to effectively improve non-English languages coverage, while -
most important maybe - not getting in the way of the English JMdict
effort.

Well, it seems like we actually have all the tools we need to do that
now : meet Transifex (https://www.transifex.net), an online platform
for the collaborative translation of software projects using
internationalization libraries like GNU Gettext or Qt .ts format. For
those who are not familiar, the principle is that a sofware team
extracts all the strings it uses into a special format and uploads it
there so that people can translate it into their language of choice.
The developers then get the translated strings back and bundle them
with their software so that it can choose the right language at
runtime. Transifex's interface is very well designed, with a fast and
efficient AJAX form for translations, language teams and managers,
etc.

The idea is that, if we can do that with software, why couldn't we do
the same with JMdict? I have thus written a small Python script that
extract all the glosses in the JMdict file, associates them to their
existing translation, adds some context information to make them
non-ambiguous (entry id/sense number/gloss number) and put them into
Gettext .po files. Upload that result to Transifex, and voilà :

https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/ (for
demonstration purposes I only extracted a subset of the JMdict - the
partial translations also come from the JMdict itself)

Now anybody with a Transifex account can translate individual glosses
online, or download the whole .po file to do it with his favorite
translation tool. Since every entry keeps an unambigous reference to
the gloss it translates, all the translations can be merged back into
the JMdict file. I tried by extracting the existing translated glosses
and merging them back and ended with an identical JMdict.

I think we could use that to
1) Use a single, open, standard format for handling JMdict gloss
translations instead of the various hacks Jim is currently relying on;
2) Have a single translation effort that would not interfere with the
actual JMdict;
3) Finally allow all the people who want to see JMdict translated into
other languages to do it.

... and the same could be done with kanjidic2, of course.

If the idea suits Jim, I'm willing to finalize my scripts and start
maintaining the effort on Transifex.

Right now the script acts by creating a translation entry for every
gloss, and putting the keb & reb of the entry + english gloss in the
message field, so the translator has a glimpse at both the gloss and
Japanese word it refers (see this for instance:
https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/l/fr/view/
). This may not be the most suitable solution, since it may not always
be desirable to have a 1:1 match for every gloss. An alternative would
be to have one translation entry per sense, with a special character
to separate the translator's input into several glosses.

So, this is my latest crazy idea to get more non-English stuff into
JMdict. What do you guys think?

Alex.

#4565 From: Jim Breen <jimbreen@...>
Date: Fri Nov 4, 2011 1:41 am
Subject: Re: Meaning of [iK] (irregular kanji)? (+ Non-Joyo *readings*)
breen_jim
Send Email Send Email
 
I'm going to top-post (sorry) and try and keep it short.

In general I like the idea of having extra information
about the status of readings in words available. It comes
down to what information is available, how and where to
record it, and how to display it in a dictionary client without
creating a huge visual clutter. The colours used for the kanji
in WWWJDIC are an example of something that was easy to do
as the information is readily available, and there was no clutter.

With readings things get messier. For a start there is no real canon
of what is and is not an approved/recognized/etc. reading.
Virtually every 漢和字典 has a different take on them. For 常用漢字
there is a list of "standard" readings (see
http://www.csse.monash.edu.au/~jwb/jouyoureadings.html) but
these really just mean that in textbooks where a word uses a
reading not on the list, it should be written out in kana. AFAIK
there is no "grading" of these readings, apart from the grades
associated with first 1,000 or so 教育 kanji.

Then there's the question of how to display this information in a
meaningful way. Consider 飲む, which Nils mentions below. The
WWWJDIC display starts: 飲む(P); 呑む; 飮む(oK); 服む(iK) 【のむ】
with 呑 and 飮 in green showing they are non-常用.
What can be said meaningfully about the の.む reading? It's
a recognized 訓読み  of 飲, 呑 and 飮, and for 飲 it's even on
the standard reading list. For 服 it's not recognized at all.
How can that be shown without turning a reasonably succinct
entry display into a mess?

There is a way of getting this information, and it's a click away.
Clicking on "[Examine] the kanji ...." takes you to the kanji
details, which really has most of the extra information.
The only thing lacking is the status of the のむ reading of 飲.
One thing I'd like to do eventually is get that into KANJIDIC in
some way. I want to hold off until maintenance of KANJIDIC
gets into an online database. The present kanji database
system is reasonably complex, and further categorization  of
readings is something I simply cannot address in its current
form.

What I could do without too much hooplah is to draw on the
list of "standard" readings of 常用漢字 kanji and coax WWWJDIC
into highlighting them in the kanji display (e.g. putting them in red.)

Getting onto "popularity" briefly, note that the "P" in EDICT2 is not
really a popularity flag on surface forms; it's an attempt to flag to 20k
or so most common entries, and where there is a choice of surface form
or reading, show which one is the common one. It is derived from slightly
more fine-grained data in JMdict. For the general "for the masses" interfaces
I wouldn't suggest going beyond it. Finer-grained details should be available,
but at the price of a click or two to look at the underlying data (at
present you
can go off the database by clicking on "Edit", but perhaps there should be
a full view, taking you to something like:
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=1076749
for the hard-core people.)

Well, I failed to be brief.

Jim

On 3 November 2011 21:53, Nils Roland Barth <jmdict.nbarth@...> wrote:
> Hi Stuart, (and all)
>
> Concrete question/proposal:
> * Could we mark non-Joyo *readings*?
> (Currently non-Joyo *characters* are marked in purple 人名用
>  or green 表外字 – this is v. useful.)
>
> As René notes, this is done via a triangle in some dictionaries,
> and presumably this could be determined automatically,
> assuming we have a list of Joyo readings.
>
> This would be useful in (automatically) flagging potentially
> confusing readings.
>
> As a first step, we’d need to assign grades to *readings*
> in the kanji dic (currently it has grades for *characters*,
> and sorts readings by on/kun/name, but doesn’t grade readings).
>
>
> Together with figuring out what reading is implied by a spelling,
> this should allow us to automatically mark:
> * non-Joyo readings (needs grading)
> * non-standard readings (specific category of [iK]) (doable already)
> …and also presumably:
> * non-standard okurigana usage?
> (Presumably able to be done automatically.)
>
> I’m not proposing this be done now – this is a lot of work – but it
> would be an interesting project longer-term.
>
>
> <snip: Stuart explains (P)>
>
> Thanks for clearing that up – so:
> * (P) is “Popularity” (at the level of a giving spelling),
>  determined by given sources
>  (hence not editable)
>  It’s about absolute popularity of words (concretely,
>  of given strings of characters), not relative popularities
>  of spellings.
>
> * [iK] is (as René and Jim wrote) for *errors* or irregular
>  uses, determined manually by referring to dictionaries
> (This can be semi-automated by using list of readings
>  in the kanji dict to find pronunciations that don’t work,
>  but semantic errors/mismatches require human judgment.)
>
> As a concrete example of an “expressive” [iK],
> writing のむ(飲む/呑む) as 服む (for taking medicine)
> (which I added and marked as [iK]) seems to fit:
> Both JMdict and 広辞苑 list only the 音読み 「フク」
> and no 訓読み so (by this criterion) it’s [iK]
> b/c it’s not a valid reading.
>
>
> More detailed thoughts on popularity:
>
> Spellings should be ordered in list of popularity, but
> beyond that there’s no indication of *how* much more
> popular a spelling is.
>
>
> For example, crudely Googling for lemma forms of おもう 思う yields:
> * 思う 1,320,000,000
> * 想う    27,000,000
> * 念う        58,700
>
> I.e., there’s about 2 or 3 orders of magnitude (1.7 & 2.7)
> difference in frequency of these spellings, which I’d
> summarize as:
> * 思う is the standard spelling
> * 想う is a reasonably common variant
>  (e.g., typical native speakers would recognize and may use it)
> * 念う is pretty uncommon, but accepted
>  (e.g., some native speakers probably wouldn’t recognize it,
>   and would be sensible to use furigana)
>
> This is partly reflected in standards: only 思う is a Joyo reading
> (others marked with triangle in my 大辞泉), but OTOH I don’t
> know how you’d know other than by Googling that 想う is much more
> common than 念う – maybe 漢検 level?
>
> This is admittedly rather fine-grained popularity
> information, and rather laborious to determine and tricky to
> present, but it would be nice to include somehow someday.
>
> Referring to standards (what grade is a reading – e.g.,
> is it in Joyo? What level on the kanken?) gives a clear
> and objective way to do this w/o reinventing the wheel;
> real-world popularity would be interesting but lots of work.
>
>
>  ~nils
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>



--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

#4566 From: Jim Breen <jimbreen@...>
Date: Sat Nov 5, 2011 12:45 am
Subject: Re: JMdict internationalization effort - let's (finally) do it!
breen_jim
Send Email Send Email
 
Greetings,

Thank to Alexandre for (re)opening this topic. It's
something I think is very important, and I'd love to
see some progress with it.

I have interpolated a couple of comments, and added a
longer one at the bottom.

On 4 November 2011 01:51, Alexandre Courbot <gnurou@...> wrote:
> I come up with this topic about once every year, so since 2011 is
> coming to an end I thought I should bring it back on the table. ;)

Let's hope there is some movement.

> Some may remember (3 years ago already) that, as the writer of a
> software that uses JMdict but is also used by non-English speakers
> (many French people notably), I got lots of requests for more complete
> French translations in the dictionary. This raised my interest as to
> how current translations of JMdict are handled and how it is possible
> to contribute to them. If I remember correctly, Jim is currently
> handling translations through various files of various formats and
> merges them into the JMdict file (the same applies to kanjidic2),
> which makes it hard and inconsistent to maintain.

Quite true. Actually things with the French translations are getting worse,
because the blending in of the French glosses (from Jean-Marc
Desperrier's project) is done using sequence numbers and sense
numbers, and as we delete/merge entries and reorder senses, the
number of failed or screwed-up merges grows. (About 270 glosses
are failing at present.)

> ......Thanks to the great
> work done by Stuart, we now have a good way to add new entries and
> amend existing ones, but it still does not handle translations in
> languages other than English. At the same time, it is perfectly
> understandable that JMdict, as a project, wants to focus on English
> instead of spreading into as many directions as there are spoken
> languages on the planet.
>
> So at that time I thought it would be nice to have an interface
> similar to what Stuart did directed at translators of the JMdict, so
> that people can collaboratively translate the dictionary in other
> languages, à la Tatoeba. This would allow to move all existing
> translations into a single format (which would probably simplify Jim's
> life), to effectively improve non-English languages coverage, while -
> most important maybe - not getting in the way of the English JMdict
> effort.

I have some thoughts/comments on this, which I'll add later.
The Tatoeba example is useful. That project certainly has enabled
a huge amount of parallel translation of sentences, and does it
with very loose controls, and only after-the-event quality control,
something I'm not sure would work that well in a dictionary.

> Well, it seems like we actually have all the tools we need to do that
> now : meet Transifex (https://www.transifex.net), an online platform
> for the collaborative translation of software projects using
> internationalization libraries like GNU Gettext or Qt .ts format. For
> those who are not familiar, the principle is that a sofware team
> extracts all the strings it uses into a special format and uploads it
> there so that people can translate it into their language of choice.
> The developers then get the translated strings back and bundle them
> with their software so that it can choose the right language at
> runtime. Transifex's interface is very well designed, with a fast and
> efficient AJAX form for translations, language teams and managers,
> etc.

It's an excellent platform. I'm particularly interested in from the
position of WWWJDIC's interface, which uses a Gettext-like approach
to text strings to drive its English and Japanese versions. I'd love to
see a French version, for example.

> The idea is that, if we can do that with software, why couldn't we do
> the same with JMdict? I have thus written a small Python script that
> extract all the glosses in the JMdict file, associates them to their
> existing translation, adds some context information to make them
> non-ambiguous (entry id/sense number/gloss number) and put them into
> Gettext .po files. Upload that result to Transifex, and voilà :
>
> https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/ (for
> demonstration purposes I only extracted a subset of the JMdict - the
> partial translations also come from the JMdict itself)
>
> Now anybody with a Transifex account can translate individual glosses
> online, or download the whole .po file to do it with his favorite
> translation tool. Since every entry keeps an unambigous reference to
> the gloss it translates, all the translations can be merged back into
> the JMdict file. I tried by extracting the existing translated glosses
> and merging them back and ended with an identical JMdict.
>
> I think we could use that to
> 1) Use a single, open, standard format for handling JMdict gloss
> translations instead of the various hacks Jim is currently relying on;
> 2) Have a single translation effort that would not interfere with the
> actual JMdict;
> 3) Finally allow all the people who want to see JMdict translated into
> other languages to do it.
>
> ... and the same could be done with kanjidic2, of course.
>
> If the idea suits Jim, I'm willing to finalize my scripts and start
> maintaining the effort on Transifex.
>
> Right now the script acts by creating a translation entry for every
> gloss, and putting the keb & reb of the entry + english gloss in the
> message field, so the translator has a glimpse at both the gloss and
> Japanese word it refers (see this for instance:
> https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/l/fr/view/
> ). This may not be the most suitable solution, since it may not always
> be desirable to have a 1:1 match for every gloss. An alternative would
> be to have one translation entry per sense, with a special character
> to separate the translator's input into several glosses.
>
> So, this is my latest crazy idea to get more non-English stuff into
> JMdict. What do you guys think?

Several comments.

First, anything that could see JMdict move away from its present
approach of being a JE base file with other languages hacked in later
would be a Good Thing, if not a Great Thing.

Second, a very key issue is how it would be seen and handled in the
database. The easy thing would be to replicate the databases, having
a jmdictdb_en, jmdictdb_fr, etc. and squish the glosses together later.
That would result in a lot of the problems with the current approach
just continuing, although it may well simplify the translation process.

The ideal approach would be for the database itself to be truly JM. For
example, we have at present (just looking at the Eng and Fre bits)

<entry>
<ent_seq>1030630</ent_seq>
<r_ele>
<reb>エレベーター</reb>
<re_pri>gai1</re_pri>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>エレベータ</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>elevator</gloss>
<gloss>lift</gloss>
<gloss xml:lang="fre">ascenseur</gloss>
</sense>
<sense>
<xref>昇降舵</xref>
<gloss>elevator (aviation)</gloss>
<gloss xml:lang="fre">gouvernail de profondeur (aviat)</gloss>
</sense>
</entry>

[I added that second French sense using Collins Robert...)

To enable this to work, the database interface needs to be able to
handle the extra language aspects. At the very least, non-English glosses
need to have language tags. Perhaps that is enough?

At present the JEL (edit language) for the above is:

[1][n]
   elevator; lift
[2][n]
   elevator (aviation)
   [see=1938460・昇降舵[1]]

It could just have something like: "[l:fre] ascenseur" and
"[l:fre] gouvernail de profondeur (aviat)" added to make it
work.

Ideally the interface could be able to be made a bit
friendlier to people from non-English backgrounds. Use
of colours for languages, for example.

Anyway, I might stop there, and let others join in the discussion.
I'd be very interested in Stuart's views.

Thanks for raising the topic (yet again.)

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

Messages 4537 - 4566 of 4980   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help