Ngram this! — The 中文 Ngram challenge
Original title: The most fun you can have (legally) on a Saturday night in Beijing outside the fifth ring
If you haven’t already seen what Google has come up with…
…then you’re probably in danger of becoming an offline recluse who lives in Beijing exurbia and considers “social interaction” giving a nod to the elderly gentleman who walks by every morning as you exercise at 5:30am.
But if you’ve got that problem, then why not submit your favorite Ngram sets in the comments and win the Ngram challenge! (Award amount to be announced as soon as sponsor is finalized)
Need some background? Here’s what Google says:
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years.
Pretty simple in the abstract. Mind-bogglingly cool if you get creative with your searches. Too bad I haven’t yet. Up above is my best effort: I like how 和谐 (“harmony”) in blue looks poised to take over the 共产党 (Communist) party itself. But a lot of other tries got zilch, e.g. with 大哥大 vs 手机 (dàgēdà vs shǒujī — out-of-date vs modern words for mobile phone), 大哥大 simply didn’t register, no hits.
Let me know what you see. Looking forward to seeing some sweet charts.
Update: selected charts from the comments (thanks, everyone) — click on link to get the Ngram chart
- 右派,左派 — “rightist, leftist”
- 先生,同志 — “Mr.” vs “Comrade”
- 北京,上海,广州,成都,天津,大连 — Beijing, Shanghai, Guangzhou, Chengdu, Tianjin, Dalian [esp. nice if you’re a Guangzhou fan!]
- 美国,日本,印度,苏联,俄国,朝鲜,韩国 — America, Japan, India, USSR, Russia, North Korea, South Korea
- 中国 — China [note I put smoothing at 10]
- 他,她,它 — he, she, it
It’s also helpful to keep in mind that differences in scale may obliterate data for small items when compared to large. As Louis Platz says below…
If there is a large difference between the distribution of the items you graph, the scale on the y-axis may be too large to show variation in the less popular items. For example, when I graph 台湾, the maximum value on the y-axis is .0300% and the scale rises at .0020% increments. Now, compare this to when I graph 经济; the maximum y-axis value is .5% and the scale rises in .05% increments. When I search 台湾 and 经济 together, the y-axis scale is set at a max of.5% with .05% increments — 经济’s set up. This scale is large enough to perceptually render the variation in 台湾 to zero for much of its distribution. So then, if you want a specific understanding of an item’s distribution, you should graph it independently.
With respect to segmentation and what constitutes a word (always an interesting topic in itself) Chad reports:
Looking at their datasets, Google did word segmentation BEFORE N-gram compilation, but their segmentation is less than perfect. 圣诞节 expressed as “圣诞 节” returns results. Likewise, “俄 国” and “周 恩来” seem to be the right segmentation. “万 圣 节” vs. 万圣节 returns different results but similar in magnitude, for reasons I don’t yet understand.
Bonus question I: what search term below is represented by red (blue = 共产党, Communist party)? Question and answer courtesy of André in the comments below. Hint: look carefully at the years.
Bonus question II, from Chad in the comments: 蓝 色,红色,绿色,黄色,白色,黑色 [blue, red, green, yellow, white, black] Without looking, can you guess which one starts to break away from the pack around 1990? (the space in 蓝 色 is intentional, as 蓝色 yields no results)