Stringing along
2007-09-17 01:10I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.
Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.
If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:
- strlen() no longer counts characters.
- random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.
In fact, even in the absence of encodings you have similar problems, e.g.
- the length of a string is not the same as the number of glyphs, because of combining characters.
- even in a fixed width font, the number of glyphs is not the same as the width of the string on the display, because some characters are double-width.
Even ASCII is not simple enough to be immune from problems:
- random access into strings can produce garbage if the string contains escape sequences.
- some logical characters (e.g. newline) are not represented as a single byte (i.e. CR LF).
- in some contexts you can use typewriter-style backspace/overstrike to implement something like combining characters.
On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.
I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)
What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.
Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)
To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.
no subject
Date: 2007-09-17 02:05 (UTC)I'm not sure that's true of the UTF-8 case. It's always possible to resynchronise UTF-8, and any competent parser ought to do so rather than just generating garbage. The worst case is that you end up with a single invalid character (though, of course, if you were expecting a useful string, the effect of that might be somewhat worse)
no subject
Date: 2007-09-17 07:03 (UTC)Also, it's pragmatically useful to treat UTF-8 parse errors as ISO-8859-1, since that makes the software less brittle. Unix makes your locale setting essentially global - in particular, it does not support charset and language metadata on text files - so if you have a UTF-8 locale but sometimes need to process non-UTF-8 data, you want your traditional text processing tools to have relaxed parsers so that you don't have to keep frobbing your environment to make them work correctly.
However if you do that you sacrifice UTF-8's ability to resynchronize.
no subject
Date: 2007-09-17 10:29 (UTC)Provided you retain knowledge of the extent of the entire string it's possible to determine whether you've plonked your pointer at a byte in the middle of a UTF-8 sequence or a byte that's an 8859 fallback. In my experience (which isn't extensive, but is at least recent and vivid) you have to retain that knowledge anyway for a whole host of other reasons.
In our code base, we model strings as forward-iterable sequences of characters. This lets us keep stuff as UTF-8 internally and, perhaps surprisingly, doesn't get in the way much even though we spend quite a bit of time picking strings apart. We've not even bothered with a function to step backwards yet.
Yes, we'd have been happier if the standard library had given us a cheap way of manipulating UTF-8, but we're in an embedded environment and don't want to bring in however many megabytes of locale guff.
no subject
Date: 2007-09-17 10:07 (UTC)no subject
Date: 2007-09-17 06:07 (UTC)This meant that all stream-opening constructs grew the concept of external coding, so you can transcode on input or output (or, indeed, have a binary rather than string-based external source/sink).
no subject
Date: 2007-09-17 07:23 (UTC)If the idea of external encoding is similar to Unix locales then it is inadequate. You need to be able to transcode different parts of the input data in different ways depending on the data itself, for example when dealing with MIME objects transferred via HTTP or email.
A lot of software will want to work entirely in one encoding, and it makes sense for them to do transcoding and I/O together. But this should be understood as a convenient wrapper around two separate layers, where the I/O layer just deals with binary streams without any textual semantics.
no subject
Date: 2007-09-17 08:04 (UTC)The reason SBCL does what it does is that the underlying language standard specifies a string as "An ARRAY of CHARACTER" and using variable-size characters depending on the what the programmer asks for is the least broken model you can fit into that straight-jacket.
no subject
Date: 2007-09-17 09:34 (UTC)Yes, you are right that stackable I/O layers can give you a nice API to some of MIME, but it's too heavyweight for RFC 2047 encoded words, for example. Text transcoding must be separate from I/O but pluggable into it, like compression, encryption, base64, etc.
no subject
Date: 2007-09-17 09:02 (UTC)Have I got this right? What you're suggesting is that strings should be an array or list of pointers.
There are logical reasons why this is preferable to the current mess - library dependence is better than language dependence - but I can't help thinking that this particular solution is worse than the original problem: a lot of common string operations will become rather difficult. Strlen() being the obvious one - how much do we have to do before we know the size of the blobs? - but there are more subtle ones like Ucase() and casting to numbers which, while they are arguably bad programming practice, are very widely used.
Plus the whole idea of hoping that all participants have the same version of a particular string library replaces the problem of different parties using different string encoding schemes with a new problem, with the only saving grace being that all parties involved can at least compile... And maybe run if the library standard provides enough information for a flexible memory allocation scheme to manage the varying sizes of the blobs.
no subject
Date: 2007-09-17 10:50 (UTC)strlen has unclear semantics. If it's the size of the blob then it has a trivial O(1) implementation, since blobs need to know their sizes. Alternatively it might be the number of codepoints, number of glyphs, displayed width, etc. all of which depend on contextual information (encoding, charset, font). The library needs to make it convenient to deal with this context, but I don't have any suggestions how. Given a convenient library, operations like case-coercion and parsing out numbers are straight-forward. (I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.)
The exact representation of blobs and how their memory is managed would have to depend on how high-level the language is, i.e. how much complexity it's willing to hide. A lower-level language probably wants to expose the representation of its scatter/gather data structure, and when copying and sharing happen. A higher-level language probably wants to hide these details under the blob abstraction. For example, Erlang's implementation of binaries is fairly complicated and varies depending on the binary's size, according to memory management and copying trade-offs, but the language still has a programmer-visible scatter/gather structure for fast appends.
The more I look at it the simpler it gets...
Date: 2007-09-17 13:54 (UTC)I expect you've considered this already: a standard format for the library, in which calls 0 to 127 return the corresponding ASCII characters.
Yes, this will be abused horribly, but a lot of the common objections to this scheme will probably go away. However, I doubt that it is possible (and I am certain that it is far, far less desirable) to achieve any further backward-compatibility with the existing UTF-8 'standard'
...Or is it? UTF-8 would become just another library and it may as well be the system default when right-to-left late-period Ptolemaic Coptic can't be found.
Speaking of which, how (or where, as in 'which layer') should rtl and ltr be implemented? If you're using linked lists rather than arrays, there is scope to tidy up a truly biblical mess. And to please the deep-sea-geeks by encoding a suitably exotic branched-list schema for Classical Klingon.
Back in the real world, I am entirely in agreement with your point here:
Nothing you can do will ever eliminate bad coding but, in higher-level languages, there is a clear need for a better structure for strings. This may sound counterintuitive - after all, the complexity will be hidden - but a day spent inserting unicode API calls to C++ libraries in an environment designed for Excel's VBA 'everything's-a-string' macro generator is a masterclass in the limitations of the current state of play.
Re: The more I look at it the simpler it gets...
Date: 2007-09-17 14:28 (UTC)rtl and ltr is (I think) mostly the job of the text rendering library - Pango, or something like that. You're probably right that it's useful to be able to segment binaries into logical sections (e.g. different lines or paragraphs, as well as segments of different directions) whilst still chaining them together in a meaningful way (e.g. DOM). So you probably don't want to entirely hide scatter/gather behind the blob abstraction.
Strings and blobs need different APIs
Date: 2007-09-17 10:30 (UTC)Cocoa's NSData class (http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSData_Class/Reference/Reference.html) (for storing raw binary data) is much simpler than its string class: it only stores a raw buffer of bytes and a buffer size.
You can, of course, create strings from raw binary data, but when you do, an encoding must be specified. Ditto for transforming strings to binary data: you have to give the output encoding format.
For C++ mavens, the ICU C++ <a href="http://icu-project.org/apiref/icu4c/classUnicodeString.html>UnicodeString string class</a> has an excellent API, similar to NSString.
Re: Strings and blobs need different APIs
Date: 2007-09-17 12:14 (UTC)Re: Strings and blobs need different APIs
Date: 2007-09-17 23:54 (UTC)I think the biggest problem is that it takes a long time to educate programmers that strings are a lot more complicated than just raw bytes of memory ending in NULL, and Unicode is a lot more than "16-bit strings". C strings are horrible to use not only because its API is horrible, but its innate representation (NULL termination) is error-prone and slow.
Also, in most real-world usage, accessing individual characters in a string is very rare. I work on a website building application where we have extremely heavy text manipulation, and we have a total of eighteen lines of code that index characters into strings. Five of those lines are for checking the program's registration code, six of those lines are for syntax highlighting code, leaving a measly seven other lines of code for every single other aspect of the program. I think having a character type as well as a string type is fine, although like a string, it's important to assign that character some sort of code point (i.e. use Unicode or whatever).
I think it's also OK (and healthy) to have multiple string classes, since strings can be used in so many different ways. Ideally you'd want all the different string classes to have the same API, but it's not completely necessary.
Re: Strings and blobs need different APIs
Date: 2007-09-19 07:33 (UTC)Manipulating strings is a fundamental part of writing software, and any language that doesn't provide a suitable abstraction for doing so is quite simply going to fail. By getting rid of built-in string types, all you're doing is shifting around the cognitive load.
The problem here, I think, is that C strings are a terrible starting-point for any discussion of how to do things better. If all you have to start with is a dumb byte-sequence that just happens to be called "string" for historical reasons, then choosing to go back to treating it like a dumb byte sequence seems like a good idea. :)
Re: Strings and blobs need different APIs
Date: 2007-09-20 08:26 (UTC)no subject
Date: 2007-09-17 11:56 (UTC)aiming to encourage a style of string manipulation more like perl's than C's
This seems like a good way to sell the idea, given that people are often happy (rightly or not!) with their quick-fix solutions to their character encoding problems.
no subject
Date: 2007-09-17 22:07 (UTC)The standard C library functions strcmp, strncpy and snprintf were also useful. For video game localization, there's not much in the way of sophisticated string processing to do.