Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard
|
| List Price: | $54.99 |
| Price: | $44.40 & eligible for FREE Super Saver Shipping on orders over $25. Details |
Availability: Usually ships in 24 hours
Ships from and sold by Amazon.com
27 new or used available from $32.95
Average customer review:Product Description
Offers an in-depth introduction to the encoding standard and provides the tools and techniques necessary to create today's globally interoperable software systems. Presents strategies for implementing various aspects of the standard. Softcover.
Product Details
- Amazon Sales Rank: #742178 in Books
- Published on: 2002-09-26
- Original language: English
- Number of items: 1
- Binding: Paperback
- 896 pages
Editorial Reviews
From the Back Cover
"Rich has a clear, colloquial style that allows him to make even complex Unicode matters understandable. People dealing with Unicode will find this book a valuable resource."
--Dr. Mark Davis, President, The Unicode Consortium
As the software marketplace becomes more global in scope, programmers are recognizing the importance of the Unicode standard for engineering robust software that works across multiple regions, countries, languages, alphabets, and scripts. Unicode Demystified offers an in-depth introduction to the encoding standard and provides the tools and techniques necessary to create today's globally interoperable software systems.
An ideal complement to specifics found in The Unicode Standard, Version 3.0 (Addison-Wesley, 2000), this practical guidebook brings the "big picture" of Unicode into practical focus for the day-to-day programmer and the internationalization specialist alike. Beginning with a structural overview of the standard and a discussion of its heritage and motivations, the book then shifts focus to the various writing systems represented by Unicode--along with the challenges associated with each. From there, the book looks at Unicode in action and presents strategies for implementing various aspects of the standard.
Topics covered include:
- The basics of Unicode--what it is and what it isn't
- The history and development of character encoding
- The architecture and salient features of Unicode, including character properties, normalization forms, and storage and serialization formats
- The character repertoire: scripts of Europe, the Middle East, Africa, Asia, and more, plus numbers, punctuation, symbols, and special characters
- Implementation techniques: conversions, searching and sorting, rendering, and editing
- Using Unicode with the Internet, programming languages, and operating systems
With this book as a guide, programmers now have the tools necessary to understand, create, and deploy dynamic software systems across today's increasingly global marketplace.
0201700522B08092002
About the Author
Richard Gillam is a senior development engineer at Trilogy, a leading developer of large-enterprise e-commerce solutions. He is a former member of IBM's Globalization Center of Competency, where he was one of the original designers of the open-source International Components for Unicode and was responsible for several of the international frameworks in the Java Class Libraries. Rich is a former columnist for C++ Report, a regular presenter at the International Unicode Conferences, and a Specialist Member of the Unicode Consortium.
0201700522AB08092002
Excerpt. © Reprinted by permission. All rights reserved.
As the economies of the world continue to become more connected together, and as the American computer market becomes more saturated, computer-related businesses are increasingly looking to markets outside the United States to grow their businesses. At the same time, companies in other industries are not only beginning to do the same thing (or, in fact, have been doing so for a long time), but are also turning to computer technology, especially the Internet, to grow their businesses and streamline their operations.
The convergence of these two trends means that it's no longer just an English-only market for computer software. To an ever greater extent, computer software is being used not just by people outside the United States or by people whose first language isn't English, but by people who don't speak English at all. As a result, interest in software internationalization is growing in the software development community.
Many things are involved in software internationalization: displaying text in the user's native language (and in different languages depending on the user), accepting input in the user's native language, altering window layouts to accommodate expansion or contraction of text or differences in writing direction, displaying numeric values according to local customs, indicating events in time according to the local calendar system, and so on.
This book isn't about any of these things. It's about something more basic, which underlies most of the issues listed above: representing written language in a computer. There are many different ways to do this; in fact, several approaches exist for just about every language that's been represented in computers. And that's the problem, too. Designing software that's flexible enough to handle data in multiple languages (or at least multiple languages that use different writing systems) has traditionally meant not just keeping track of the text, but also keeping track of which encoding scheme is being used to represent it. If you want to mix text in multiple writing systems, this bookkeeping becomes even more cumbersome.
The Unicode standard was designed specifically to solve this problem. It aims to be the universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode provides representations for more than 90,000 characters.
Unicode has been around for 12 years now and is in its third major revision, with support for more languages being added with each revision. It has gained widespread support in the software community and in a wide variety of operating systems, programming languages, and application programs. Each of the semiannual International Unicode Conferences is better attended than the previous one, and the number of presenters and sessions at the conferences grows correspondingly.
Representing text isn't as straightforward as it appears at first glance; it's not merely as simple as picking out a bunch of characters and assigning numbers to them. First, you have to decide what a "character" is, which isn't as obvious in many writing systems as it is in English. In addition, you have to contend with issues such as how to represent characters with diacritical marks applied to them, how to represent clusters of marks that represent syllables, when differently shaped marks on the page are considered different "characters" and when they're considered just different ways of writing the same "character," and in what order to store the characters when they don't proceed in a straightforward manner from one side of the page to the other (for example, some characters stack on top of each other or appear in two parallel lines, or the reading order of the text on the page may zigzag around the line because of differences in natural reading direction).
The decisions you make on each of these issues for every character affect how various processes, such as comparing strings or analyzing a string for word boundaries, are performed, making them more complicated. Also, the sheer number of different characters that can be represented using the Unicode standard complicates many processes on text.
For all of these reasons, the Unicode standard is a large, complex affair. Unicode 3.0, the last version published as a book, is 1,072 pages long. Even at this length, many of the explanations are fairly concise and assume the reader already has some degree of familiarity with the problems to be solved. It can be kind of intimidating.
This book aims to provide an easier entree into the world of Unicode. It arranges things in a more pedagogical manner, takes more time to explain the various issues and how they're solved, fills in pieces of background information, and adds information on implementation and existing Unicode support. It is this author's hope that this book will be a worthy companion to the standard itself, and will provide the average programmer and the internationalization specialist alike with all the information they need to effectively handle Unicode in their software.
About This Book
There are a few things you should keep in mind as you go through this book:
- This book assumes that the reader either is a professional computer programmer or is familiar with most computer-programming concepts and terms. Most general computer science jargon isn't defined or explained here.
- It's helpful, but not essential, if the reader has some understanding of the basic concepts of software internationalization. Many of those concepts are explained here, but if they're not central to one of the book's topics, they're not given a lot of time.
- This book covers a lot of ground, and it isn't intended as a comprehensive and definitive reference for every single topic it discusses. In particular, it does not repeat the entire text of the Unicode standard; the idea is to complement the standard, not replace it. In many cases, this book summarizes a topic or attempts to explain it at a high level, leaving it to other documents (typically the Unicode standard or one of its technical reports) to fill in all the details.
- The Unicode standard changes rapidly. New versions come out yearly, and small changes, new technical reports, and other things happen even more quickly. In Unicode's history, terminology has changed many times, and this will probably continue to happen occasionally. In addition, many other technologies use or depend on Unicode, and they are also constantly changing. I'm certainly not an expert on every single topic I discuss here. (In my darker moments, I'm not sure I'm an expert on any of them!) I have made every effort to ensure that this book is complete, accurate, and up-to-date, but I can't guarantee I've succeeded in every detail. In fact, I can almost guarantee that some information given here is either outdated or just plain wrong. Nevertheless, I have made every effort to minimize this problem, and I pledge to continue, with each future version, to try to bring it closer to being fully accurate.
- At the time of this writing (January 2002), the newest version of Unicode, Unicode 3.2, was in beta and thus still in flux. The Unicode 3.2 specification is scheduled to be finalized in March 2002, well before this book actually hits the streets. With a few exceptions, I don't expect major changes between now and March, but they're always possible. As a consequence, the Unicode 3.2 information in this book may wind up wrong in some details. I've tried to flag all of the Unicode 3.2-specific information, and I've tried to indicate the areas that I think are still in the greatest amount of flux.
- Sample code in this book is almost always in Java. This choice was made partially because Java is the language I use in my regular job, and thus the programming language I think in these days. I also chose Java because of its increasing importance and popularity in the programming world in general and because Java code tends to be somewhat easier to understand than, say, C (or at least no more difficult). Because of Java's syntactic similarity to C and C++, I also hope the examples will be reasonably accessible to C and C++ programmers who don't also program in Java.
- The sample code is provided for illustrative purposes only. I've gone to the trouble, at least with the examples that can stand alone, to ensure that the examples compile, and I've tested them to verify that I didn't make any obvious stupid mistakes, but they haven't been tested comprehensively. They were also written with more of an eye toward explaining a concept than being directly usable in any particular context. Incorporate them into your code at your own risk!
- I've tried to define jargon the first time I use it or to indicate a full explanation is coming later. There's also an enormous glossary at the end of this book to which you can refer if you come across an unfamiliar term that isn't defined.
- Numeric constants--especially numbers representing characters--are generally shown in hexadecimal notation. Hexadecimal numbers in the text are always written using the 0x notation familiar to C and Java programmers.
- Unicode code point values are shown using the standard Unicode notation, U+1234, where "1234" is a hexadecimal number containing four to six digits. In many cases, a character is referred to by both its Unicode code point value and its Unicode name: for example, "U+0041 LATIN CAPITAL LETTER A." Code unit values in one of the Unicode transformation formats are shown using the 0x notation.
The Author's Journey
Like many people in the field, I fell into software internationalization by happenstance. I've always been interested in language--written language in particular--and (of course) in computers. But my job had never really dealt with this issue directly.
In the spring of 1995, that situation changed when I went to work for Taligent. Taligent, you may remember, was the ill-fated joint venture between Apple C...
Customer Reviews
Want to understand the Unicode standard? Start here!
The book has three main parts:
(1) Unicode in essence: an architectural overview of the Unicode standard (six chapters) where you also get bits of terminology and history.
(2) Unicode in depth: A guided tour of the character repertoire (six chapters) where you get a lot about writing systems that can be represented in Unicode, and less about the Unicode characters.
(3) Unicode in action: implementing and using the Unicode standard (five chapters) where you get information aimed at computer programmers that wish to implement parts of the standard or write applications dealing with multilingual text.
Though this book is very long (~800 pages) it is still shorter and a lot more clear than the Unicode standard itself (over 1000 pages).
Code examples are in Java but they are not ment to be complete solutions and so there is no accompanying website or a CD.
Professional programmers are the target audience of this book. The reader is faced with many topics in linguistics, history and data structures. Readers with computer science background would probably appreciate how classic traditional algorithms were adapted and how data structures are used in character sets with a significantly larger number of character than 256.
The author of the book states that the book is about "representing written language in a computer", which may be misleading to some readers. The book is about the Unicode standard. Obviously, there are many other ways to represent written language other than the methods described in the book. As chapter 2 teaches... There are always more ways (sometimes better ways) to represent your data.
Part 2 of the book will not cover every writing system of the world. A better book for that would be "The world's writing systems".
Part3 is probably the most interesting and useful part for programmers (though the first part is important, in my opinion to those who want to UNDERSTAND Unicode).
You can learn about a lot of things and skip many too (depending on your interest and need). I believe that most readers will skip most of the topics.
This is not a book that is read lightly, but it is hellovalot easier and more fun to read than the Unicode standard itself. It appears that once you read this book and get what you want from it, you will end up going to read the Unicode standard only to see updates, hopefully, not for clarifications.
I am dealing with Natural Language Processing and being a Hebrew speaker I also have a lot of text in Hebrew (almost all the time it is Hebrew with other languages too, e.g. documents that contain Hebrew with some English). This book helps understand the difficulties, the current implementations and give you a solid ground to start thinking how you can make things better. Current infrastructure for Hebrew is either poor or not perfect and in most cases the better solutions are proprietary. There seems to be always problems representing 'plain' text in more than one language without stepping into the trap of the soup of different ways to do it. Unicode is one way to do it (arguably, not the best, yet it is alive and growing) I hope this book can help more people understand what they are up against, clear the fog and help people do better implementations.
Perfect Companion Volume to the Standard Itself.
This book is an outstanding companion volume to the Unicode standard. In fact, if you had to pick one, you'd quite possibly be better off owning this book INSTEAD of the standard. The author display an impressive knowledge of the world's writing systems and of the inner workings of the Unicode standardization process.
Part I of this book starts with the history of character encoding standards, from Morse code to today. It then presents a thorough review of the Unicode architecture and associated standards. The information presented was mostly excellent, although I found the section describing SCSU a little bit too sketchy (and the actual code in part III not entirely satisfactory to fill in the gaps).
Part II gives an overview of the various writing systems and character ranges represented in Unicode. Even for a nontechnical audience, this part would be fascinating with all the typographical and historical trivia it presents.
Part III discusses various algorithms applicable to text processing in a Unicode context. I must admit that I found this part a bit of a letdown. Many of the algoritms are only sketched out because discussing them in detail would be beyond the scope of the book. Quite possibly, the pages dedicated to these algorithms would have been better spent presenting examples of code using the various existing APIs for handling Unicode (Java, ICU, Perl, Windows, MacOS X).
This does not take away from the fact that this is a great book that any programmer interested in Unicode should own.
A great manual for the practical use of Unicode
Unicode Demystified is a great manual and a good read. It earns a place on the bookshelf of programmers who deal with modern text processing, which is based on the Unicode standard. It is a great resource for anyone involved in software internationalization and localization.
Gillam provides a lot of useful details, history and explanations for the structure of the character set, and shows how to use it. The book is a companion to the print and online resources of the Unicode standard itself, and provides the glue to many of the pieces, the how-to's and basic data structures.
For example, the Unicode encodings UTF-8/16/32 (and BOM) are explained very well, bidirectional text is discussed with a lot of insight, and the family of Indic scripts with their special features is presented with examples for how to encode Indic text.




