Group: International Macintosh Users Group (IMUG)
(A Forum for Multilingual / Multiscript Computing)
Date: April 21, 2005, 7-9 p.m.
Speaker: Markus Scherer (IBM Corporation)
Topic: Analyzing Unicode Text: Regular Expressions, Boundaries,
Sets and More
Location: Apple Computer, Apple Campus, 1 Infinite Loop, Cupertino
Take Saratoga/Sunnyvale exit off 280, turn South
to Cupertino, turn left onto Mariani Avenue, left
to Infinite Loop.
Admission: $4, free for IMUG members
Contact: Roger Sherman, (650) 859-5981
roger [dot] sherman [at] sri [dot] com
Regular Expressions have been widely used for many years to analyze,
parse or extract desired information from text data. They are used in
applications large and small, and everywhere in-between, from simple
search operations in word processors to scripting languages such as
Perl to queries on large data bases.
Traditional regular expressions cannot easily deal with a character
set of the size and complexity of Unicode. To address this
shortcoming, the Unicode Consortium has published Technical Report
#18, a set of guidelines for extending regular expressions to handle
Unicode data. Following this allows organizations to correctly deal
with data in different languages and scripts.
This paper will review the issues and techniques involved in writing
Regular Expressions for Unicode data. The guidelines from TR 18 will
be reviewed, including a discussion of Unicode encoding forms,
character properties and classes, text boundaries, case sensitivity
and normalization, and the implications of all of these for handling
different languages in regular expressions. The paper will also
survey the capabilities and limitations of those regular expression
implementations known to provide significant support for Unicode.
The presentation is intended primarily for users of regular
expressions rather than implementers of regular expression engines.
Note: This is a repeat of an IUC presentation -
http://www.global-conference.com/iuc27/program.html
Markus Scherer is the current ICU team manager and a software
engineer at IBM developing ICU and other Unicode/Globalization
solutions. He has contributed to many parts of ICU including
character conversion, bidi, normalization, Unicode properties and
collation. After graduating from the University of Kaiserslautern,
Germany, in computer science he worked on projects for wireless and
mobile computing with IBM. A strong interest in languages brought
him into the Internationalization parts of the projects, followed by
his current focus on Unicode and Globalization.
-----------------------------------------------------------------
IMUG has its own site on the World Wide Web:
http://www.imug.org.
Check it out! It's currently not up-to-date, but we're working on
fixing that.
For a map of our meeting location go to:
http://www.imug.org/events.html
and click on the map link.
We also post our meeting announcements and handouts at Yahoo! Groups:
http://www.yahoogroups.com, under the group name "imugi18n"
(IMUG-i-eighteen-n).
---------------------------------------------------------------------
To be added to the IMUG mailing list, please email to:
imugi18n-subscribe@yahoogroups.com