fanf | Stringing along

I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.

Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.

If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:

strlen() no longer counts characters.
random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.

In fact, even in the absence of encodings you have similar problems, e.g.

the length of a string is not the same as the number of glyphs, because of combining characters.
even in a fixed width font, the number of glyphs is not the same as the width of the string on the display, because some characters are double-width.

Even ASCII is not simple enough to be immune from problems:

random access into strings can produce garbage if the string contains escape sequences.
some logical characters (e.g. newline) are not represented as a single byte (i.e. CR LF).
in some contexts you can use typewriter-style backspace/overstrike to implement something like combining characters.

On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.

I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)

What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.

Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)

To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.

Flat | Top-Level Comments Only

From:

fanf

I think that languages should not have character or string types, but instead a type for binary blobs without any textual semantics.

strlen has unclear semantics. If it's the size of the blob then it has a trivial O(1) implementation, since blobs need to know their sizes. Alternatively it might be the number of codepoints, number of glyphs, displayed width, etc. all of which depend on contextual information (encoding, charset, font). The library needs to make it convenient to deal with this context, but I don't have any suggestions how. Given a convenient library, operations like case-coercion and parsing out numbers are straight-forward. (I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.)

The exact representation of blobs and how their memory is managed would have to depend on how high-level the language is, i.e. how much complexity it's willing to hide. A lower-level language probably wants to expose the representation of its scatter/gather data structure, and when copying and sharing happen. A higher-level language probably wants to hide these details under the blob abstraction. For example, Erlang's implementation of binaries is fairly complicated and varies depending on the binary's size, according to memory management and copying trade-offs, but the language still has a programmer-visible scatter/gather structure for fast appends.

hairyears.livejournal.com

I expect you've considered this already: a standard format for the library, in which calls 0 to 127 return the corresponding ASCII characters.

Yes, this will be abused horribly, but a lot of the common objections to this scheme will probably go away. However, I doubt that it is possible (and I am certain that it is far, far less desirable) to achieve any further backward-compatibility with the existing UTF-8 'standard'

...Or is it? UTF-8 would become just another library and it may as well be the system default when right-to-left late-period Ptolemaic Coptic can't be found.

Speaking of which, how (or where, as in 'which layer') should rtl and ltr be implemented? If you're using linked lists rather than arrays, there is scope to tidy up a truly biblical mess. And to please the deep-sea-geeks by encoding a suitably exotic branched-list schema for Classical Klingon.

Back in the real world, I am entirely in agreement with your point here:

I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.

Nothing you can do will ever eliminate bad coding but, in higher-level languages, there is a clear need for a better structure for strings. This may sound counterintuitive - after all, the complexity will be hidden - but a day spent inserting unicode API calls to C++ libraries in an environment designed for Excel's VBA 'everything's-a-string' macro generator is a masterclass in the limitations of the current state of play.

If you force compatibility with 7 bit ASCII then you break the alternate ISO 646 character sets, and EBCDIC, and ISO 2022, and UTF-16, and UTF-32. My aim is to make the lower levels (built in to the language) completely independent of text encoding, and to discourage character-at-a-time string manipulation in favour of higher-level tools such as regexes. Hence, the language should have no character type (and since it doesn't exist it doesn't have a standard encoding) and although you can subdivide binaries down to a byte (or even less, as I discussed in my previous post) there's no implication that bytes have anything to do with characters, or code points, or glyphs.

rtl and ltr is (I think) mostly the job of the text rendering library - Pango, or something like that. You're probably right that it's useful to be able to segment binaries into logical sections (e.g. different lines or paragraphs, as well as segments of different directions) whilst still chaining them together in a meaningful way (e.g. DOM). So you probably don't want to entirely hide scatter/gather behind the blob abstraction.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Tony Finch's blog

Stringing along

Stringing along

no subject

The more I look at it the simpler it gets...

Re: The more I look at it the simpler it gets...

Profile

December 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags