Unified and the Script Encoding Initiative

>Rick McGowan and Deborah Anderson

>Unified and the Script Encoding Initiative: The Importance of Encoding

>Scripts for the Future

>Rick McGowan, a former engineer for NeXT and Apple, became involved

>in the development of the international character encoding standard

>Unicode after dealing with the chaos caused by multiple encoding

>standards. He is now a Vice President of the Unicode Consortium.

>Deborah Anderson is a researcher in the Dept. of Linguistics at UC

>Berkeley who, in cooperation with Rick McGowan, leads the Script

>Encoding Initiative, which aims to get the missing scripts -- modern

>minority scripts and historical scripts -- into the Unicode Standard.

>Rick will deliver a brief overview of Unicode -- its background and

>purpose -- and will explain why covering the world's scripts in

>Unicode is important. Deborah will address how the Script Encoding

>Initiative (SEI) is accomplishing the goal of encoding scripts. Both

>Rick and Deborah were featured in an article in the San Jose Mercury

>News, and the SEI project has been discussed in MIT's Technology

>Review and Multilingual Computing.

Rick began his talk by explaining that Unicode is a character encoding system a lot like ASCII, except that ASCII has codes for just the character set commonly used in North America (128 letters), and Unicode has codes for almost everybody's character sets (94,000 letters, dingbats, and ideographic characters, with Chinese being by far the biggest single subset). Unicode is the plumbing for many modern computer's text based language handling interface, including Windows 2000 or XP, Microsoft Office 2000 or XP, Mac OS X, and Internet Explorer. Sometimes it goes by the name UTF-8 or UTF-16.

Unicode was created after the market for personal computers got enough beyond the world where ASCII was dominant that the large number of incompatible character coding systems became a political and social issue, as well as a technical one. The hope was that by standardizing language interface, economies of scale could be brought to bear so that people would not have to learn a "dominant local language" (like English) just to use a computer. They could just plug in the right fonts and rule modules and use the machine in the language they already knew. Unicode now supports all speech communities with more than five million native speakers on the planet, as well as a number of others including some ancient scripts.

Unicode is the product of many minds working together to make interoperability among computers possible. The first attempt at universal character encoding happened in Palo Alto at Xerox, where they implemented one system that encapsulated Japanese, Arabic, and some Indian languages. It was not commercially successful, but it started the discussion. In 1988 Dr. Joe Becker wrote "Unicode 88", a paper about a system where you could keep adding character codes until everything was specified in it. In 1991 Unicode Inc. was founded, and a "merger" between that standard and ISO 10646 paved the way for a truly universal script coding system. In 1992 Unicode 1.0 Volumes 1 and 2 were released, and since then it has been a matter of filling in the more obscure written forms, one at a time.

Deborah Anderson then took over to explain that so far about 53 scripts have been encoded. She showed us a table of 95 scripts that have not yet been encoded, including some like the ideographic language of the Pharos in Egypt. She explained that most minority scripts get encoded because somebody goes off to the big city to get an education and starts wanting to use a computer in their Native Language. The minimum amount required to get a script entered is about $5,000 and a lot of volunteer time from people that know the language well. Languages that currently need financial sponsors for encoding include Balinese, Ol Chiki (spoken in the Bangladesh area), and N'ko (an East African language).

During Q&A a number of points came up:

There are a number of countries where dominant languages are spoken by populations that discriminate against other language speakers and their scripts. Typically those groups will pursue a course of denial about a minority language until it is implemented, and then they will ignore that fact diplomatically.

One of the scripts they have is a phonetic one that covers every sound from every far north native Canadian language as Inuit and Cree. It is expected that any story they have to tell in any of their languages can be told using that script.

One of the reasons Unicode takes a "go slow" approach to adding scripts is that they need to do it right the first time. Not doing so could have dramatically expensive implications for people that invest in uploading information sets that would then become incompatible with the standard.

For more information, please visit:

http://www.unicode.org

http://www.linguistics.berkeley.edu/~dwanders

Tian Harter