Franklin Pezzuti Dyer

Home     Posts     CV     Contact     People     RSS

A shitpost about Unicode's superscript digits

I was writing some C++ code to do computations involving polynomials. For my own gratification, I wanted to write a function that would pretty-print polynomials to my terminal, given a vector of coefficients. Naturally, I thought I'd use the special Unicode characters ⁰¹²³⁴⁵⁶⁷⁸⁹ to make the exponents look good, because x⁵ looks a lot nicer than x^5. I started by writing a function for translating a digit between 0 and 9 into its Unicode superscript counterpart.

char digit_superscript(int d) {
   // some stuff here...
}

But no, that won't work. Unicode characters encoded using UTF8 can actually occupy up to four bytes. A char can only hold one byte, so I had to return a string instead.

string digit_superscript(int d) {
   // some stuff here...
}

The easy (and probably performant) way of doing this would be to list out all ten cases and return the corresponding Unicode character. But that would be really ugly and repetitive code, and I don't care about how efficiently my polynomials are printed. I remembered an old trick that has served me well when working with ASCII: to get the nth letter of the alphabet, you can start with 'a' and simply add $n$, because the letters a..z are consecutive. So perhaps I could get the superscript digit for $d$ by starting with the code point for and adding $d$.

The character is code point U+2070, which is encoded in UTF8 as the bytes e2 81 b0. So I gave this a try:

string digit_superscript(int d) {
   string num = "\xe2\x81\xb0";
   num.at(2) += d;
   return num;
}

And I tested the function by trying to print out the digits from zero through nine:

int main() {
   for (int i = 0; i < 10; i++)
      cout << digit_superscript(i);
   cout << "\n";
}

The output was ⁰ⁱ⁲⁳⁴⁵⁶⁷⁸⁹. What the fuck?

That's right: the Unicode superscript digits are not consecutive. The superscript zero is U+2070, but U+2071 is the superscript "i" character . Even better, the code points U+2072 and U+2073 are not mapped to any character at all right now, just "reserved for future use". The superscript digits ⁴⁵⁶⁷⁸⁹, however, are fortunate enough to occupy the consecutive code points from U+2074 through U+2079.

So what's the deal with ¹²³? According to Wikipedia they are inherited from the encoding ISO-8859-1, which was a single-byte encoding extending ASCII that possessed these three superscript digits and none of the other seven. Unicode was considerate enough to allow the ISO-8859-1 characters to retain their code points in its Latin-1 block, with the unfortunate consequence that the superscript digits ¹²³ be separated from their bretheren by 8000 or so characters. But hey, at least has to keep it company. They're vaguely related, right? You know, because they're both superscripted... things?

Not only that, but ¹²³ were not even consecutive in ISO-8859-1. The characters ²³ are U+00B2 and U+00B3, but ¹ is U+00B9. Lovely.

Fig1

Alright, back to coding. I was able to use the offset trick for the digits ⁰⁴⁵⁶⁷⁸⁹ but the digits ¹²³ have to be treated as their own special case.

string digit_superscript(int d) {
   switch (d) {
      case 1:
         return "\u00b9";
      case 2:
         return "\u00b2";
      case 3:
         return "\u00b3";
      default:
         string num = "\xe2\x81\xb0";
         num.at(2) += d;
         return num;
   }
}

How disgusting...


back to home page
The posts on this website are licensed under CC-by-NC 4.0.