A Question About Text Input Predictions

I’m posting this here because we have a few readers who’d really know about this sort of thing, and I’m hoping to pick their brains.

I’m doing a ton of data entry these days for Phonemica. This week that means typing city names all over China.

Here’s the setup:
I’m on a Mac, OS X 10.8, using the default simplified character input method for hanyu pinyin.

So, I type W-U and the first choice is 无. Makes sense because I type that character quite a bit, probably much more than 五.

Normally, I type W-U-X-I and get 无锡, which makes sense as well. It’s a city that speaks Wu and is next to Changzhou with a similar dialect. It’s a city name I type more than most other city names.

W-U-S-H-A-N could provide a number of things. 武山,吴山,巫山 etc. I just now typed 巫山. Then shortly after I typed W-U-X-I, and the first choice was 巫溪. Now, that’s exactly the name I wanted, and I am 100% certain I’ve never typed those two characters together on this computer before. Ever.

So why was it the first choice? Is there some predictive algorithm that knows that 巫山 and 巫溪 are related (they’re neighbouring towns in Sichuan)? Or, maybe more likely, does it just take the fact that I just typed 巫 for 巫山 and so it favours that character when it’s a potential option in the future?

I’m curious because there’s really no reason my computer should have thought W-U-X-I should have been something other than the city in Jiangsu, 无锡.

Thoughts?

6 responses to “A Question About Text Input Predictions”

  1. Carl says:

    On my iPad:

    Wuxi 無錫
    Wushan 巫山
    Wuxi 無錫

    Nope, it didn’t move 巫溪 into the primary position. I do find that with the Japanese IME, results can vary wildly, often in response to perceived part of speech or what you last selected.

  2. Tezuk says:

    I find it so much easier to type with tones, Taiwanese style. I was shocked when I found out mainlanders don’t use tones when typing. In this case though typing (Microsoft 2010 Pinyin) the tones doesn’t give me accurate place names (wu2xi1吳溪 wu1xi1烏溪).

    • Steve (Syz) says:

      Just to clarify (cuz I’ve never heard of this): it’s your experience that most Taiwanese indicate tone marks while using pinyin ime? In special situations, like this one, or in general?

      • Kellen says:

        I’m not Tezuk but I can answer this one. The short of it is that if you type using 注音符號/ㄅㄆㄇㄈ, you have to enter the tone. That’s regardless of if you’re typing on your computer or your phone. It’s not something I plan to get used to while in Taiwan.

      • Tezuk says:

        It’s not just for special situations, every character is typed with a tone. I use the newest Microsoft IME that was installed on my laptop, but just changed the Romanisation to Hanyu Pinyin and for mine you don’t necessarily need to type the tone (not sure about zhuyin), but when you do it very much limits your choices and picks the correct characters more frequently than non-tone imput. Plus it is a good way of remembering all those tricky tones!.

  3. Daan says:

    Both explanations would work, but I’m inclined to say the second one is more likely. You could definitely implement some sort of predictive algorithm like that, based on which words commonly occur closely together. Assuming the IME’s corpus contained a few examples of 巫山.and 巫溪 in the same sentence, that might very well bias it towards 巫溪 given the input WUXI. But for that to work, you would need a sizable language model backing up your IME, which takes up quite a bit of storage space. There’s also the issue of computational load if you have to look up common collocations for every input string. I’m not saying it isn’t happening (they may have stored it in some sort of optimized format) but it’s harder to implement than your second explanation.

Leave a Reply