An Answer to Character Encoding Problems
A long while back I wrote a short series of posts on a small range of topics centered around the creation of characters, both modern an old. At the end of one such post, I mentioned that I had a solution to the problem, however I never got around to posting my solution, in part because I felt I couldn’t articulate the idea as completely as it seems to be in my head. Then a recent comment by 慈逢流 got me thinking an answer was fair. This post is my attempt to provide one.
The Problem: Limited Characters
There are a number of characters that have existed in traditional sources that simply cannot exist on computers today, at least not with any wide use. There are obscure characters like the rare family name ben 㡷, which is composed of 本 under 广. These are characters which are encoded in unicode, however unavailable at least on the device with which I am currently writing this post. That’s primarily a font issue, but it goes beyond that. A character exists, for example, composed 林, four times in a square format. Even if one were to create a font with this character, one would need to either have it replace another existing glyph, or assign it to a special use area and then do some fancy replacement string coding for it to be shown. Either solution is not really a solution. Font encoding as we currently know it is insufficient for the full range of Sinitic characters. Even if more glyphs were added to the Unicode standard, which is constantly being done, it is insufficient.
Part of the reason for that is how fonts work. Any given letter, symbol, character or whatnot is, to the computer which is displaying this post, not at all what you are seeing it as. Each character is essentially assigned an address. That address is then given a vector-based image to be displayed when called upon. Different fonts are just different collections of images, each an outline of the shape which is seen on screen. This, I assume, is common knowledge, so I won’t go into it further here. Feel free to check Wikipedia for more.
The gist of it is that for each Sinitic character, there needs to be drawn an outline of how the character should look. I’ve made a number of fonts over the years for the Latin alphabet and Arabic script. Those are time consuming enough. It’s no wonder that the vast majority of non-standard fonts on different Chinese font sites are no more than filters applied to existing Song/Ming, Hei, Kai or other common typefaces.
So that’s the problem. Outlines are time consuming to create for each glyph, and even then they’re incomplete. The solution, as I see it, is what I will call “skeletal glyphs” and a sort of flexible encoding.
Part 1: Strokes & Syntax
The basic components of any character are ⼀⼁⼂⼃⼄ and ⼅. Those then combine to create more complex but very common compound components, such as ⼍ ⼏ or ⼇. My solution is a system for which each character, existing or imagined, can be determined from these components. That exists in any number of dictionaries and databases. 木 for example is ⼀⼁⼂ and ⼃ in a set form.
The system I propose would also include macros. 林 is that, twice, horizontally. So 木 would exist as a sort of macro that could be called in any instance where it was needed. 李 calls on the macro of 木 and 子 which the system parses as their component parts. Again, this sort of thing exists already. This however is the flexible encoding. You want to call up 本, fine. 本 already exists as a macro. It could be called as 本 or called as 木一 or ⼀⼁⼂⼃⼀. Simpler characters matter less. 林|林/林|林 may be a way to call in that 8 木 character. At this point the specific syntax isn’t important. Let’s move on.
Part 2: Skeletal Forms & Rendering
This is the part that is most relevant to the actual display of the characters. Sorry if it gets a little unclear.
Rather than having characters rendered as individual outlines of each character, each piece of each stroke and each interaction is designed and rendered. In Song/Ming typefaces, the upper right hand corner of a box is done in a consistent way. The stroke in common between ⼉ and ⼔ has a typical top, typical curve and typical end point. The syntax about which I wrote above calls on the components of characters, but in skeletal form. That is, it actually treats the strokes as strokes, not as outlined components.
Then there’s another layer above that, that of a rendering system, which then fills in the strokes based on their interactions according to set rules. A horizontal meeting a vertical to form the upper left corner of a right angle gets a specific treatment. The font designer then instead of having to design countless characters could instead design 20-40 interactions which are then compiled and rendered as needed. I’ve actually gone through this and determined what interactions would be needed, but I can’t find the Moleskine in which I drew them and don’t have a scanner anyway. But it’s not that hard to figure out. I say 20-40, which is actually way more than I originally felt were needed, but I think that as this would get tried out there would be evident bugs and so the extra forms would be for specific interactions that needed tweaking. Maybe it turns out that some part of 臣 gets rendered wrong if not addressed specifically, as an example.
A designer could then determine very specific changes to the mostly standard forms, opening up a mess of new typeface possibilities. Spend a couple weeks tweaking the specific interactions to render how you want them, and then let the system take over.
What needs to happen
Execution of this would of course be hugely time consuming. It would involve programming an engine to work on the end user’s computer. It would involve a new format of font to work specifically within that engine.
However it would be possible to have the engine spit out a TTF file of outlines to cover those glyphs encoded in Unicode, so even if this were only done for the sake of designers, and not on every end users computer (i.e. for live rendering), it would still open up a lot more possibilities for typefaces. Of course the live-rendering based on a new syntax for calling components is what I’d really like to see.
Computers are fast enough, and since the most common characters would exist as macros anyway, it would not likely require much processing power to render if done right. Requiring even less effort for most involved, the TTF/OTF files could be created for anything in Unicode and then call the syntax could be called on for rarer characters or custom characters.
This is the solution to which I referred long ago. Hopefully this was clear enough to follow. If I’ve left any gaps, please let me know in the comments and I’ll fill them in.