fanf: (Default)
[personal profile] fanf

I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.

Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.

If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:

  • strlen() no longer counts characters.
  • random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.

In fact, even in the absence of encodings you have similar problems, e.g.

  • the length of a string is not the same as the number of glyphs, because of combining characters.
  • even in a fixed width font, the number of glyphs is not the same as the width of the string on the display, because some characters are double-width.

Even ASCII is not simple enough to be immune from problems:

  • random access into strings can produce garbage if the string contains escape sequences.
  • some logical characters (e.g. newline) are not represented as a single byte (i.e. CR LF).
  • in some contexts you can use typewriter-style backspace/overstrike to implement something like combining characters.

On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.

I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)

What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.

Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)

To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.

ext_3375: Banded Tussock (Default)
From: [identity profile] hairyears.livejournal.com

I expect you've considered this already: a standard format for the library, in which calls 0 to 127 return the corresponding ASCII characters.

Yes, this will be abused horribly, but a lot of the common objections to this scheme will probably go away. However, I doubt that it is possible (and I am certain that it is far, far less desirable) to achieve any further backward-compatibility with the existing UTF-8 'standard'

...Or is it? UTF-8 would become just another library and it may as well be the system default when right-to-left late-period Ptolemaic Coptic can't be found.

Speaking of which, how (or where, as in 'which layer') should rtl and ltr be implemented? If you're using linked lists rather than arrays, there is scope to tidy up a truly biblical mess. And to please the deep-sea-geeks by encoding a suitably exotic branched-list schema for Classical Klingon.

Back in the real world, I am entirely in agreement with your point here:

I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.

Nothing you can do will ever eliminate bad coding but, in higher-level languages, there is a clear need for a better structure for strings. This may sound counterintuitive - after all, the complexity will be hidden - but a day spent inserting unicode API calls to C++ libraries in an environment designed for Excel's VBA 'everything's-a-string' macro generator is a masterclass in the limitations of the current state of play.

December 2025

S M T W T F S
 123456
78910111213
14151617181920
21222324 252627
28293031   

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated 2025-12-26 21:33
Powered by Dreamwidth Studios