Chapter 8. Text Processing

Introduction

You may recall a time when the string was seemingly a very simple data type. Computing the length of a string or converting it to lowercase or uppercase was a trivial exercise. (However, your trivial solution almost certainly worked for only one particular language or locale.)

Well, no more. Unicode is considerably more complex than the strings of yore. With characters that occupy one or many bytes, simple operations like computing the string length are no longer so simple. There are special cases like the famous “Turkish I” in which the ordinary letter I (U+0049) in the Turkish language turns into a lowercase special dotless ι (U+0131) instead of the usual dotted i (U+0069). Changing the case of a string ...

Get XQuery: The XML Query Language now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.