Search the web
Sign In
New User? Sign Up
IntelliWebSearch-l · IntelliWebSearch Users' Group
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Japanese Issues - digest   Message List  
Reply | Forward Message #114 of 967 |
Here is a summary of what was happening to cause problems with
Japanese in IWS and the solutions, followed by a bit more detail from
Mike.

Characters are encoded in UTF-8 by IWS, so for example the character
µû (fish) is E9AD9A, but when I sent it to www.alc.co.jp, which is a
searchable dictionary requiring a finish string, it was decoded as a
completely different character, ò¯, which has a Shift-JIS value of
E9AD, the same as the UTF-8 value of the first two bytes of the
¡Æfish¡Ç

character (the final 9A apparently doesn¡Çt encode anything in
Shift-JIS).

However, when I manually added a space (%20) to the front of the
finish string, this somehow persuaded the site to decode the fish
character correctly as a UTF-8, and the search was successful.

But for some reason, even with the added %20, other characters such as
Æü (day) were still being decoded as Shift-JIS by the site. Mike
figured out that part of the finish string itself was Shift-JIS
encoding, which was confusing the site on some occassions when it read
UTF-8 characters. So Mike's solution was to convert this bit of finish
string to UTF-8 at this site:
http://code.cside.com/3rdpage/us/url/converter.html and repaste it
into the setting window. This has solved the problem completely.

The other problem was with Japanese characters in the settings window
for modifying Google searches, etc. Mike also solved this by
suggesting I convert the Japanese at the above site and paste the code
into the settings window. So for example, a google search with
intitle:%E7%94%A8%E8%AA%9E will actually come out in google with the
correct characters.

So with these minor workarounds IWS is now fully functioning with
Japanese. Cheers Mike!


***************************

Mike writes:

The site is written in shift-JIS. IntelliWebSearch can only produce
UTF-8 (and Windows 1252, which is no use for Japanese).
IntelliWebSearch can therefore only access the site if it (or maybe
the server) is set up to accept UTF-8 too, which is highly likely.
However we've got to let it know we are sending UTF-8, because if we
send anything that can be (mis)interpreted as shift-JIS, that is what
it will see it as. Your trick of adding a UTF-8 space seems to work in
some cases, but sometimes the site interprets it as shift-JIS anyway.
We'll have to work out why. If you look more closely at your examples,
you'll notice they all have "&word_in2=%82%A9%82%AB%82%AD%82%AF%82%B1"
in them. "%82%A9%82%AB%82%AD%82%AF%82%B1" is probably in shift-JIS,
since that is the site's native encoding.

If my hunch is right, this would explain the site's erratic behaviour.
It is probably seeing a piece of shift-JIS and doing its best to
interpret the rest that way too. What we need to do is send a pure
UTF-8 URL so it can't get it wrong.

Using the converter I suggested
(http://code.cside.com/3rdpage/us/url/converter.html), we can convert
thesnippet to UTF-8 (%E3%81%8B%E3%81%8D%E3%81%8F%E3%81%91%E3%81%93).
The other
code-like part ("PVawEWi72JXCKoa0Je") only uses Western European
characters
(there are no percentage signs), so it is probably the same in
shift-JIS and
UTF-8 (it is - I checked).

If you put the converted snippet back into one of your examples you
get:
http://www2.alc.co.jp/ejr/index.php?word_in=%E9%AD%9A&word_in2=%E3%81%8B%E3%
81%8D%E3%81%8F%E3%81%91%E3%81%93&word_in3=PVawEWi72JXCKoa0Je

Bingo, at least on my system. So the IntelliWebSearch settings are:

Encoding = UTF-8
Quotes always off = yes
Start = http://www2.alc.co.jp/ejr/index.php?word_in=
Finish =
&word_in2=%E3%81%8B%E3%81%8D%E3%81%8F%E3%81%91%E3%81%93&word_in3=PVawEWi72JX
CKoa0Je

In short, the trick is to convert every part of the URL to UTF-8.

Steven wrote:

> the double byte characters aren't recognized at all as such in the
> memory and just become converted into question marks


Mike writes:

AutoHotKey scripts don't interpret multi-byte characters in text input
boxes correctly. As you say, they are read as a series of question
marks. I got round this problem in the search window with a
complicated workaround which only works there because there is only
one text input box (IntelliWebSearch doesn't read what you type into
the box, but what is displayed on the monitor). I can't use the same
workaround in the settings window because there are several text input
boxes.

I hope one day AutoHotKey will be made fully Unicode compliant. In the
meantime I'm afraid you will have to convert anything you want to type
into the settings window manually with the converter I suggested
(http://code.cside.com/3rdpage/us/url/converter.html)








Tue Jan 23, 2007 6:14 pm

glia164
Offline Offline
Send Email Send Email

Forward
Message #114 of 967 |
Expand Messages Author Sort by Date

Here is a summary of what was happening to cause problems with Japanese in IWS and the solutions, followed by a bit more detail from Mike. Characters are...
Steven Fraser Smith
glia164
Offline Send Email
Jan 23, 2007
7:15 pm

Mornin' all, a brief recap: back in January Steven Fraser Smith sent me a series of problem with shift-JIS encoded sites. Although I found workarounds for the ...
Michael Farrell
traduttoreit
Offline Send Email
May 1, 2007
5:30 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help