An Answer to Character Encoding Problems

A long while back I wrote a short series of posts on a small range of topics centered around the creation of characters, both modern an old. At the end of one such post, I mentioned that I had a solution to the problem, however I never got around to posting my solution, in part because I felt I couldn’t articulate the idea as completely as it seems to be in my head. Then a recent comment by 慈逢流 got me thinking an answer was fair. This post is my attempt to provide one.

The Problem: Limited Characters

There are a number of characters that have existed in traditional sources that simply cannot exist on computers today, at least not with any wide use. There are obscure characters like the rare family name ben 㡷, which is composed of 本 under 广. These are characters which are encoded in unicode, however unavailable at least on the device with which I am currently writing this post. That’s primarily a font issue, but it goes beyond that. A character exists, for example, composed 林, four times in a square format. Even if one were to create a font with this character, one would need to either have it replace another existing glyph, or assign it to a special use area and then do some fancy replacement string coding for it to be shown. Either solution is not really a solution. Font encoding as we currently know it is insufficient for the full range of Sinitic characters. Even if more glyphs were added to the Unicode standard, which is constantly being done, it is insufficient.

Part of the reason for that is how fonts work. Any given letter, symbol, character or whatnot is, to the computer which is displaying this post, not at all what you are seeing it as. Each character is essentially assigned an address. That address is then given a vector-based image to be displayed when called upon. Different fonts are just different collections of images, each an outline of the shape which is seen on screen. This, I assume, is common knowledge, so I won’t go into it further here. Feel free to check Wikipedia for more.

The gist of it is that for each Sinitic character, there needs to be drawn an outline of how the character should look. I’ve made a number of fonts over the years for the Latin alphabet and Arabic script. Those are time consuming enough. It’s no wonder that the vast majority of non-standard fonts on different Chinese font sites are no more than filters applied to existing Song/Ming, Hei, Kai or other common typefaces.

So that’s the problem. Outlines are time consuming to create for each glyph, and even then they’re incomplete. The solution, as I see it, is what I will call “skeletal glyphs” and a sort of flexible encoding.

The Solution
Part 1: Strokes & Syntax

The basic components of any character are ⼀⼁⼂⼃⼄ and ⼅. Those then combine to create more complex but very common compound components, such as ⼍ ⼏ or ⼇. My solution is a system for which each character, existing or imagined, can be determined from these components. That exists in any number of dictionaries and databases. 木 for example is ⼀⼁⼂ and ⼃ in a set form.

The system I propose would also include macros. 林 is that, twice, horizontally. So 木 would exist as a sort of macro that could be called in any instance where it was needed. 李 calls on the macro of 木 and 子 which the system parses as their component parts. Again, this sort of thing exists already. This however is the flexible encoding. You want to call up 本, fine. 本 already exists as a macro. It could be called as 本 or called as 木一 or ⼀⼁⼂⼃⼀. Simpler characters matter less. 林|林/林|林 may be a way to call in that 8 木 character. At this point the specific syntax isn’t important. Let’s move on.

Part 2: Skeletal Forms & Rendering

This is the part that is most relevant to the actual display of the characters. Sorry if it gets a little unclear.

Rather than having characters rendered as individual outlines of each character, each piece of each stroke and each interaction is designed and rendered. In Song/Ming typefaces, the upper right hand corner of a box is done in a consistent way. The stroke in common between ⼉ and ⼔ has a typical top, typical curve and typical end point. The syntax about which I wrote above calls on the components of characters, but in skeletal form. That is, it actually treats the strokes as strokes, not as outlined components.

Then there’s another layer above that, that of a rendering system, which then fills in the strokes based on their interactions according to set rules. A horizontal meeting a vertical to form the upper left corner of a right angle gets a specific treatment. The font designer then instead of having to design countless characters could instead design 20-40 interactions which are then compiled and rendered as needed. I’ve actually gone through this and determined what interactions would be needed, but I can’t find the Moleskine in which I drew them and don’t have a scanner anyway. But it’s not that hard to figure out. I say 20-40, which is actually way more than I originally felt were needed, but I think that as this would get tried out there would be evident bugs and so the extra forms would be for specific interactions that needed tweaking. Maybe it turns out that some part of 臣 gets rendered wrong if not addressed specifically, as an example.

A designer could then determine very specific changes to the mostly standard forms, opening up a mess of new typeface possibilities. Spend a couple weeks tweaking the specific interactions to render how you want them, and then let the system take over.

What needs to happen

Execution of this would of course be hugely time consuming. It would involve programming an engine to work on the end user’s computer. It would involve a new format of font to work specifically within that engine.

However it would be possible to have the engine spit out a TTF file of outlines to cover those glyphs encoded in Unicode, so even if this were only done for the sake of designers, and not on every end users computer (i.e. for live rendering), it would still open up a lot more possibilities for typefaces. Of course the live-rendering based on a new syntax for calling components is what I’d really like to see.

Computers are fast enough, and since the most common characters would exist as macros anyway, it would not likely require much processing power to render if done right. Requiring even less effort for most involved, the TTF/OTF files could be created for anything in Unicode and then call the syntax could be called on for rarer characters or custom characters.

This is the solution to which I referred long ago. Hopefully this was clear enough to follow. If I’ve left any gaps, please let me know in the comments and I’ll fill them in.

19 responses to “An Answer to Character Encoding Problems”

  1. Carl says:

    I could have sworn that the system you are describing already exists, but I can’t find it through searching right now, so maybe I just misremembered how TRON and Mojikyo work.

    Anyway, as you no doubt are aware, this method of encoding would be nice in theory, but it wouldn’t be very practical: file sizes would ballon, it would be hard to search for text, there’d be no way to specify that a with the storey and without the storey are the same letter written differently (or the various 高, 骨, etc.)… Also, the algorithm to lay out the characters would be a beast, and probably it would be hard to get independent implementations of it to be cross-compatible.

  2. Carl,

    I agree with everything you said. The export of a TTF/OTF would be the quick fix for cross-compatibility. While the algorithm would be a bitch, I imagine the existing setup in dictionaries where there is stored data that X is Y next to Z could be leveraged to automate a good portion of setting up the necessary database.

    But yeah, it wouldn’t be ideal for every end user.

  3. Sima says:

    I’m unable to view several of the characters on this page. I’m using XP and Firefox and have run through a bunch of character encodings. Could anyone tell me which font I need or how I can get it to display properly?

  4. SimSung is a good one. It’s a Ming/Song typeface. Can’t think of a good Hei face at the moment.

  5. 慈逢流 says:

    as for CJK character display troubles, i recommend using Sun-ExtA and Sun-ExtB. these two fonts provide an almost 100% coverage of Unicode 5+, a very consistent, close-to-the-standard and correctly crafted song typeface with only very minor deficiencies. i am currently writing a search engine for chinese characters (a web page where you can find characters according to reading, strokeorder, and components) and have chosen these two fonts as the reference renderings for all CJK codepoints. i have repeatedly checked back with the unicode pdf codecharts and i am not aware of any noticeable differences between then rendering chosen for the standard and the one delivered by Sun-ExtA/B.

    next, there is @font-face, a quite recent addition to the cascading stylesheets standard (CSS), which has been anticipated for around 10 years but only implemented by some browsers of this decade. it basically allows to tell the browser precisely which font file to use for which characters; if you want to design your site with a fancy custom font, you can put the *.ttf file onto your server, write a few style rules, and when a user with a modern browser like firefox or chrome views your page, the font file will be downloaded in the background, just like a picture or other web resource would be (of course, the browser will not install the font system-wide; it is just used for that single site). i must point out that, however happy i am of this development, it is yet another example for the eurocentrism that goes on in the world of computing. while the solution is appropriate for a language that uses but a few hundred glyphs, file sizes for a complete rendering of CJK are in another ballpark; the two Sun fonts together weigh in at 38.7 MB, which is 120 times the size of a 2010 web page (acc. to google). there is sadly no mechanism to transfer single characters.

    as for the complexity / speed / compatibility questions for an algorithmic character generator raised above, i just want to say, fear not; it has been done, long before, for small devices such as home theater subtitling boxes. the central issue with the automated typographic design derived from a sparse formula is artistic pleasingness (for lack of a better term); the various solutions i have seen so far tend to make characters that look like a first grader’s stick figures. and it is also not necessarily true that file sizes would balloon; true type fonts are already basically small programs that contain very exact instructions how to paint characters under different circumstances. a higher-level description can be reasonably expected to result in smaller files that are somewhat more computing-intensive to process. i assume that speed should not be a very serious obstacle for any device that can reproduce video and sound in agreeable quality.

    i do not share the concerns about the searchability of texts. the character generator would just work in the background; users will continue to work with unicode (or whatever encoding they choose). i do not understand carl’s concern “there’d be no way to specify that a with the storey and without the storey are the same letter written differently (or the various 高, 骨)”. yes of course there would be ways to specify such variants. the whole idea of a character generator is to enable users to specify nonce forms from (typo-) graphic descriptions, so 高 could be notated as ⿳亠口冋 and 髙 as ⿳亠&jzr#xe109;冋 (the jzr#xe109 part demonstrates one difficulty with this approach, and that is the number of components needed, which is quite substantial and exceeds what is currently available in unicode; in this case, it refers to the component that looks a bit like a ladder). now if you say write a piece on dunhuang texts and encounter a character not available in your encoding, then for sure you have to include a description instead of a unicode codepoint, which does make the text somewhat harder to search (but you can still choose to create a custom codepoint).

    the real difficulty in searching CJK texts is really the rampant orthographic freedom. once i tried to collect all the ways that taipei bus drivers indicated the ticket price on their con boxes. i never finished the collection: 10NT$, 10$, 十元, 什圓, 拾圓, 10¥, everything goes. i never saw 円 or 塊 used for this purpose, but the number of possible, understandable ways to write so short a phrase are truly staggering. this is one aspect you simply cannot, and should not, solve within a character encoding scheme. sure, programming oldtimers are somewhat used to switching bits in the code to transform an ‘a’ (0x61; 0b01100001) into an ‘A’ (0x41; 0b01000001), but even with latin letters the applicability of an algorithmic transformation is extremely limited. basically, you have to use tables for anything going beyond this toy example. i think there is one major attempt to bake the complicated relationships between characters (think 正體字, 異體字, 簡體字, 古字, 俗字 and so on) right into the encoding (so that variants get placed into different ‘planes’, but otherwise same codepoints), but have concluded that this is not the way.

    there is, btw, a very exciting invention by Lin Yutang which he did in the ninety-forties, while living in new york: the Ming Kwai Typewriter (have a look here: http://en.wikipedia.org/wiki/Chinese_typewriter). this is a purely electromechanical beast; it looks like a western typewriter, maybe a tad bulkier; the user presses several keys to characterize the intended glyph, and gets presented, in a small vertical window, a choice of matches. the decomposition system seems to be the same as employed by Lin for his 林語堂當代漢英詞典. sadly, only one or two of these machines were ever built, after which Lin was broke and returned to taiwan.

  6. Bruce says:

    The main obstacle to the creation or adoption of “new” hanzi is not about fonts or anything technical per se.

    It is all about the Chinese traditions regarding the written word. Ever since hanzi were used for divination, they have been regarded as something akin to sacred.

    As such, any changes to hanzi meet with ferocious resistance, and that includes the dumbing down of written Chinese since 1949 in the PRC. Even today, many businesspeople in China print their name cards in traditional Chinese because they crave the sense of authenticity it provides.

    I have been told countless times that dialect cannot be represented in hanzi. Yet HK newspapers run much of their copy, particularly about leisure topics like gourmet eating and movie star gossip, using Cantonese characters — some quite innovative — that are not understandable to most Chinese. Even when I point this out, my friends in China still refuse to “recognize” the existence of these characters or any written form of Cantonese.

    The traditions I note above are what suppress the creation and wide use of “new” hanzi, not technical concerns.

  7. Kellen says:

    I agree. This is not an attempt to make any changes to hanzi. This is instead an attempt to increase the number of characters that can be shown on computers, allowing for the display of rare characters and the potential creation of new ones. For the latter, think a digital version of 天书.

    I do think dialects/languages of Sinitic can be written in hanzi. I would just like to see that be more convenient. A good example is

  8. Sima says:

    Thanks for the font tips. Sun-ExtA and B did the trick.

    I think this is a fascinating post and I’m sorry I’d somehow missed your previous pieces on related matters.

    I think you’re doing a pretty good job of describing something with huge implications and hope you’ll expand a little more.

    How confident are you of the 20-40 interactions involved? Presumably, stroke order and component order would be critical. Do you envisage any problems with characters containing the same components in different places, e.g. 旮旯 or even the same (or similar) strokes in the same order but in different places, e.g 只 and 叭, or 土 and 士? I guess I’m thinking of this from the user/input end of things.

    In terms of how big a project you imagine this would be, do you have any idea?! It strikes me that there are an awful lot of uncertainties, but I guess it ought to be possible to run some kind of small-scale trial to see how it might pan out. Looks like you (Kellen) and
    慈逢流 might be on the same wavelength…could there be a case for seeing whether this could be developed?

    Bruce, I think the point about the “sacred” nature of Chinese characters is interesting, but my impression is that this can be rather mixed – sometimes people will play pretty free and easy with them. I used to often see 艹 above 么, in place of 蘑 on restaurant specials boards in the Northeast. I’ve not had chance to check what the history of this form is (possibly one of the later discarded official simplifications) but I susect there are a good number of “folk” characters around which we are likely to see less and less as computing preserves a fixed character set.

  9. Sima,

    Thanks for the positive feedback. I held off on this for months because I was convinced I wouldn’t be able to describe it the way it is in my head. I’ve still missed some points but I’ll see if I can’t elaborate.

    I’m pretty sure that 20-40 is actually way more than needed, but I quoted that high to give myself a buffer. Keep in mind with these ‘interactions’ that to which i’m referring is not the syntax, but rather the rendering. It’s 20-40 ways that two strokes can cross paths. 口 has 4 interactions, one at each corner. Each corner is rendered in a slightly different way. 回 has twice as many strokes but the same number of interactions.

    For things like 旮旯, there are already in Unicode a set of glyphs that refer to the positions under “Ideographic Desc. Characters” to which 慈逢流 has already referred in his comment. They are ⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺ and ⿻. The idea is that the syntax would cover these sorts of locations. So 国 could be ⿴⼞玉 in which case 玉 here is already defined as a macro combining 王 and ⼂ or else simply as ⼀⼀⼁⼂⼀. 土 and 士 are another issue, but I’m sure there are enough of these types of differences that the syntax could be written to include it. The thing is, a lot of this data already exists in 笔画 IMEs. Everyone is carrying it around on their mobile phones, most probably without realising it.

    For how big a project it would be, the answer is big. I know a couple people who are working with this sort of data as far as the combinatorial nature of the characters are concerned. But to write the whole engine to parse the data and to fine tune the syntax would be a big undertaking. The actual font creation would likely be the smallest part, despite being the most visible part to the end user.

  10. Georg says:

    What some seem to be confusing here is, whether we are talking about generic character composition on font level only (e.g., to reduce font file sizes) or about an encoding scheme alternative to Big5/GB/Unicode, analogous to encoding an alphabet rather than all the words from a dictionary. The real issue, however, might well turn out to be character input. By the way,

  11. Georg says:

    Such a new font file format which I had in mind in my previous comment would not only keep the font file size at a minimum, it would first and foremost ensure generic support for all already assigned and yet to be encoded Hanzi by referring to their standardized CDL description stored on OS level. The question remains whether a computer can and should replace a type designer by automatically generating hanzi.

    A new character encoding, on the other hand, would by no means solve the fundamental problem of missing fonts or glyphs, it would make it worse (think OpenType support, think backwards compatibility). This is why the Koreans insisted that their Hangul be encoded as precomposed blocks rather than conjoining Jamo. There is a Unicode Normalization Form for decomposed Hangul, allowing for random generic block formations, but AFAIK browser and font support is still lagging behind. Nick Nicholas makes the case for polytonic Greek, describing the myriad of ways of writing the letter .

    To date unassigned scripts such as Tangut could be incorporated in Unicode as combining radicals, but a) it’s not gonna happen, and b) on font level, this wouldn’t spare the type designer the work of handcrafting every single character individually.

    A better example of what you are suggesting is maybe Egyptian, which, I believe, is precomposed neither in the encoding nor on font level but is basically single hieroglyphs stacked together to words. Makes sense from a western perspective, but this is probably just not how East Asian writing systems are meant to work.

    Granted, some characters will never be encoded, and Private Use Areas cannot always provide the benefits of an official standard: correct display and rendering across platforms, searchability, sortability, etc. But as of Unicode 6.0, most hanzi one will ever need are already included (even their stylistic variations, using variation selectors, something that until very recently was deemed to be done on font level). We only need to find them.

  12. I’m certainly not suggesting replacing the designer. The designer would have just as important a job as he/she has now. Perhaps if anything, it would require more from the designer in terms of the aesthetic demands placed on them.

    I feel some of what I’m attempting to explain is being lost here. I unfortunately cannot remedy that at the moment. However tomorrow I promise to update this post, probably though the comments section, to address what I think isn’t being properly understood of what I propose, that which I’m calling interactions being the main one. I think that should clear up some of the confusion.

    So, until tomorrow.

  13. Chad says:

    The Wenlin Institute is developing a descriptive database called Character Description Language or CDL . I think may be the CDL Georg had referred to. It can create svg or bmp files for an arbitrary character, although I don’t know about font files. However, there’s no mention of making the data public.

  14. Kellen says:

    This may explain things a bit better.

    Engine: The thing that makes it work. Each character is a collection of skeletal strokes, or macros, which are in turn collections of skeletal strokes. This is all programmed within the engine and would be largely based on data that exists already.

    Interactions: This refers to parts of a characters. This is either a place where two strokes meet or the ends of a single stroke. For example, ⼀ has two interactions, one on each end, while ⼄ has three or four. For 旮 and 旯, which I’ll refer to again below, there are 6 for 九 and 6 for 日. I say ⼄ has 3 or four because the curve below can be a single interaction to include the point, or the point itself can be one. These are details to be worked out by the font face designer. The idea is that the interactions are programmed in the engine.

    Font: A set of 30-40 pieces of strokes or intersections, and then basic rules to cover the spaces between. The engine calls on the font in order to render each interaction. The font provides the vector images for each component and then combines them with connecting lines, the behaviour of which is also covered in the font.

    As I said, I have no intention of replacing the font designer. It’s just a new way of designing the fonts, which are integral to the system but still secondary to the skeletal system.

    Chad & Georg,
    The CDL is similar on a basic level, but what I propose, i.e. the skeleton system, is the primary feature of this system and is the main thing that separates it from the Wenlin system (which is sweet too, by the way).

  15. 慈逢流 says:

    certainly the one most accessible platform to implement and distribute a character generator is the web. i have already successfully used a technique where i paint character outlines to the browser using javascript and the html canvas element; alternatively, one could use svg for this purpose. this way, i can display characters in their reference form (which i stored in the database) in any modern browser.

    although those are images, it is possible to include the represented text in a hidden place; this allows copying the textual content (in unicode) to the clipboard. as for the 外字 (characters not found in unicode), it is possible to include them symbolically as ‘extended character references’.

    such extended character references can look like ‘&jzr#xe109;’. in this case, there is a reference to character set ‘jzr’ (where i put all the missing parts of CJK characters) and codepoint (hexadecimal) 0xe109. since all the codepoints of this character set are numerically equivalent to codepoints in the unicode private use area, it suffices to change the notation from &jzr#xe109; to  which turns the reference into a standard character reference (in the PUA). given appropriate markup and aCSS3 @font-face rule, it then becomes possible for browsers to render correct, custom-made shapes using the usual font rendering without canvas. so the web already has a number of viable options for displaying text in custom shapes.

    sure a font rendering that only works on web pages is not as attractive as a system-wide font rendering system. however, it is lightyears easier to handle web application issues than to tinker with any of the various font system rendering engines of the major platforms. every single aspect around fonts, from file formats to binary executables, is full to the brim with millions of hairy technical details, which are hard to get right. just building a correct font that works correctly under all circumstances is far from trivial and much harder than it should be. mobile devices have been having a fairly long history of introducing additional woes; in this field, too, a web application is the easiest way to get content onto the device.

    as for the CJK character description language, i have unfinished drafts what would be needed beyond what unicode provides right now.

    the most central data on characters i have are (1) strokeorders (written in what i call the 札字五筆法: eg 札 itself has the rare strokeorder 一丨丿丶乙 = 12345; 星 is filed under 丨乙一一丿一一丨一 251131121) and (2) character decompositions (札 is analyzed as ⿰木乚, 星 as ⿱日生).

    these data are derived from the kanji-database project, http://kanji-database.sourceforge.net, and are overall of a fairly good and consistent quality. one systematic weakness of the data is a certain mismatch between strokeorders and decompositions: whereas 札 12345 matches what i get when i replace the elements in the formula ⿰木乚 with their respective strokeorders, 1234 and 5, this doesn’t work for 这 4134454, as this character is decomposed as⿺辶文, which gives the wrong impression that 454 辶 (辵; 走之) is to be written *before* 4134 文. obviously, no-one in the unicode CJK department thought it necessary to devise a way to write down a character decomposition formula that will always preserve the correct strokeorder.

    fortunately, these differences are in themselves again rather systematic: in most cases, predictable elements are involved, like 辶 in 这 or 囗 in 國. both cases warrant the introduction of an additional ideographic description character (IDC; u-idc-2ff0..2fff) each: one to replace ⿺ in cases where the upper right is written first (so 这 becomes 〄文辶; here using 〄 as a placeholder), and another one to specify that the first n strokes of a component are to be written *before*, but the remaining strokes *after* writing out an intervening element; examples are 衍:⿰〶3行氵, 裏:⿱〶2衣里, and 國:⿴〶2囗或 (numbers indicating how many strokes of the first component are written before the intervening element).

    using these and probably a number of further augmentations the the ideographic decription language, it might be possible to arrive at a description that is precise enough for a character layout algorithm to produce characters from the descriptions. of course, the 札字五筆法 is rather abstract with its mere five stroke categories; in order to obtain correctly looking characters, one would need a notation that is much more specific. there is, to my knowledge, no definite listing of CJK strokes. relegating myself to what is writeable in unicode 5.1, i can say that instead of five classes, at least around 50 classes would be needed, as class 1 can mean either 一 or ㇀; 2 can mean 丨 or 亅; 3 can mean 丿 or ㇒; 4 can mean 丶 or ㇏; and class 5 can mean any of 乁乙乚乛

  16. Christoph says:

    Hey, interesting post. Sorry for checking in late (for web standards).

    I’m happy to read about a different perspective on a similar take on characters. If I understand you correctly your goal would be to build a font out of knowledge of the components and strokes you have. I actually build a very basic prototype some months ago for a completely different task. As you can see from the image I composed a valid character from it’s 5 handwritten, single components. It’s pretty simplisitic in that it assigns all components a rectangular, equaly sized box and doesn’t account for interactions w/ stroke changes.

    The goal of this undertaking then was not to get a good looking or useable font but instead to create more handwriting data for a handwriting system for Japanese & Chinese named Tegaki. (While the models visually look bad it did good for the underlying computational system.) The data used on this image is basically component structure data as hosted under http://characterdb.cjklib.org
    Missing for your task would be the stroke interaction rules, on which I already had a great good conversation with a guy interested in improving the stroke information in said database by deriving information from knowledge about the character’s components. I would be happy on any more insights on these interaction rules, they do seem to have a bag of special cases. I don’t have the examples around right now but if you are interested feel free to come back to me.

    For the font I would really suggest creating a ttf or similar font. While there are font rendering systems out there, I’m unsure if this is the way to go for unencoded (and never going-to-get-encoded?) characters.

  17. Andurriales says:

    Hallo everybody.

    I’m trying to find a list of Chinese characters decomposition. Can you please provide a link where to find it? I don’t want a software for decomposition, nor technical data (I’m new to this). Please!

    (Sorry for my English.)

Leave a Reply