Ngram this! — The 中文 Ngram challenge

Original title: The most fun you can have (legally) on a Saturday night in Beijing outside the fifth ring

If you haven’t already seen what Google has come up with…

Google Labs - Books Ngram Viewer

…then you’re probably in danger of becoming an offline recluse who lives in Beijing exurbia and considers “social interaction” giving a nod to the elderly gentleman who walks by every morning as you exercise at 5:30am.

But if you’ve got that problem, then why not submit your favorite Ngram sets in the comments and win the Ngram challenge! (Award amount to be announced as soon as sponsor is finalized)

Need some background? Here’s what Google says:

When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years.

Pretty simple in the abstract. Mind-bogglingly cool if you get creative with your searches. Too bad I haven’t yet. Up above is my best effort: I like how 和谐 (“harmony”) in blue looks poised to take over the 共产党 (Communist) party itself. But a lot of other tries got zilch, e.g. with 大哥大 vs 手机 (dàgēdà vs shǒujī — out-of-date vs modern words for mobile phone), 大哥大 simply didn’t register, no hits.

Let me know what you see. Looking forward to seeing some sweet charts.

——–

Update: selected charts from the comments (thanks, everyone) — click on link to get the Ngram chart

It’s also helpful to keep in mind that differences in scale may obliterate data for small items when compared to large. As Louis Platz says below…

If there is a large difference between the distribution of the items you graph, the scale on the y-axis may be too large to show variation in the less popular items. For example, when I graph 台湾, the maximum value on the y-axis is .0300% and the scale rises at .0020% increments. Now, compare this to when I graph 经济; the maximum y-axis value is .5% and the scale rises in .05% increments. When I search 台湾 and 经济 together, the y-axis scale is set at a max of.5% with .05% increments — 经济’s set up. This scale is large enough to perceptually render the variation in 台湾 to zero for much of its distribution. So then, if you want a specific understanding of an item’s distribution, you should graph it independently.

With respect to segmentation and what constitutes a word (always an interesting topic in itself) Chad reports:

Looking at their datasets, Google did word segmentation BEFORE N-gram compilation, but their segmentation is less than perfect. 圣诞节 expressed as “圣诞 节” returns results. Likewise, “俄 国” and “周 恩来” seem to be the right segmentation. “万 圣 节” vs. 万圣节 returns different results but similar in magnitude, for reasons I don’t yet understand.

Bonus question I: what search term below is represented by red (blue = 共产党, Communist party)? Question and answer courtesy of André in the comments below. Hint: look carefully at the years.

Google Ngram Viewer - Google Chrome 12192010 24149 PM.bmp

Bonus question II, from Chad in the comments: 蓝 色,红色,绿色,黄色,白色,黑色 [blue, red, green, yellow, white, black] Without looking, can you guess which one starts to break away from the pack around 1990? (the space in 蓝 色 is intentional, as 蓝色 yields no results)

21 responses to “Ngram this! — The 中文 Ngram challenge”

  1. André says:

    I think this is quite telling:

  2. André says:

    I obviously did something wrong there. The comment above was supposed to include the following link:

    右派和左派的比较

  3. Katie says:

    Yeah, there seems to be something a little screwy with the Chinese version. I was playing around with it yesterday and tried 毛泽东, 周恩来, 邓小平 and for reasons unknown to me, got nothing for Zhou Enlai. I tried a few words my teacher had mentioned as being currently trendy and got no hits for those either, but that’s not as surprising since they are using books as their source.

    Now to think of something cool …

  4. Katie says:

    So far the best I’ve come up with is 先生 vs. 同志, which have just now met up with each other for the first time since 1959 or so. But 北京,上海,广州,程度,天津,大连 is kind of telling,as is 美国,日本,印度,苏联,俄国,朝鲜,韩国 (俄国 not actually registering for me, but some of the searches I’ve been doing seem to randomly include things and then leave them out again).

    (Off subject: why doesn’t it accept Chinese commas?)

  5. Limao Luo says:

    You can also screw around with smoothing and get completely different graphs (eg. smoothing = 0 shows 共产党 actually got beaten by 和谐 in 2006, while smoothing = 50 shows that 和谐 didn’t even get close.

  6. Louis Platz says:

    @Katie:

    If there is a large difference between the distribution of the items you graph, the scale on the y-axis may be too large to show variation in the less popular items. For example, when I graph 台湾, the maximum value on the y-axis is .0300% and the scale rises at .0020% increments. Now, compare this to when I graph 经济; the maximum y-axis value is .5% and the scale rises in .05% increments. When I search 台湾 and 经济 together, the y-axis scale is set at a max of.5% with .05% increments — 经济’s set up. This scale is large enough to perceptually render the variation in 台湾 to zero for much of its distribution. So then, if you want a specific understanding of an item’s distribution, you should graph it independently.

  7. Chad says:

    N-grams aren’t always words and vice versa, which could explain why words like 圣诞节 don’t give any results.

    Their datasets are free to download, which is awesome.

  8. I think it’s interesting how ‘中国’ has a significant decline at the beginning of the 20th century, then picks up from the 1940s.

    中国 1900 to 2010 ngram

  9. Chad says:

    Looking at their datasets, Google did word segmentation BEFORE N-gram compilation, but their segmentation is less than perfect. 圣诞节 expressed as “圣诞 节” returns results. Likewise, “俄 国” and “周 恩来” seem to be the right segmentation. “万 圣 节” vs. 万圣节 returns different results but similar in magnitude, for reasons I don’t yet understand.

  10. André says:

    Interesting to look at the development of the singular 3. person pronouns:

    他,她,它 – 1800-200

    (I guess this is one of the few websites out there where people would actually be excited to click on a link with that heading, hehe)

  11. Syz says:

    All: I’ve linked to some of the charts as well as pasted some of the comments into the post above. Great stuff.

    @Katie: I share the annoyance — why does Google force a switch to western imperialist commas?
    @Limao Luo: good point, but keep in mind that smoothing=50 is a running avg for 50 yrs. In terms of the data google has at its disposal, a 50 yr avg would be almost equivalent to just a single average number. smoothing=0 is raw data, of course, so it’s going to show a lot of fluctuation.

  12. André says:

    The personal Cult of Mao Zedong graphed

    Pay attention to the year 1966, the first year of the Culture revolution.

  13. Syz says:

    Nice one, André. I added a graph to the post above.

  14. Claw says:

    程度? Don’t you mean, 成都?

  15. Chad says:

    黑色,红色,白色,绿色,蓝 色,黄色 — Without looking, can you which one starts to break away from the pack around 1990? (the space in 蓝 色 is intentional, as 蓝色 yields no results)

  16. Katie says:

    Er, yes. Sorry about that.

  17. Syz says:

    @Claw: good catch, thx. fixed above. Off topic: somehow wordpress decided not to email me your comment, even though it emails me everyone else’s. weird
    @Chad: Good quiz question. I added it above and even (anally) reordered so that the colors mostly matched Google’s graphing scheme.

  18. Claw says:

    If you add 香港 to the mix of cities, you’ll see the obvious peak around 1997, but the interesting thing is that in recent years, it shows the same rise as 广州. The slope of the 广州 line appears to follow 香港 almost exactly by about one year. 深圳 does not appear to follow the same rise though.

  19. pc says:

    Why is the answer to Chad’s question the way it is?

    Anyone have any insights on how China’s been using numbers?
    The use of Roman numerals shows an unexpected spike in the late 60’s.

    And the use of standard Mandarin numerals just shows a sad depressing decline into nothingness.

    I wonder why?

  20. André says:

    @pc My best guess would the standardization that Mao and the communists started in the early 50s. That’s when you got three different 3. person singular pronouns and so on. One of the reasons behind the standardization was allegedly to create unity in order to reflect power. Allegedly, Mao realized that all the powerful nations of the West had standard languages and decided to create one for China.

    (I guess improved literacy was also an important factor as well).

    If anyone is interested in more reading on the topic I can recommend a book called “Linguistic engineering: language and politics in Mao’s China”.

    A review of the book can be found here

  21. John Pasden says:

    Nice post! I’ve wanted to blog on this since I first saw Google Ngram Viewer announcement, but I’m way too behind on my blogging. Well done.

Leave a Reply to André