New linguistic corpus of Sina Weibo messages

While Kellen and Steve are still working hard on their fascinating new project, I just wanted to tell Sinoglot readers about my new corpus of Sina Weibo messages.

In the past few months, I’ve been building the Leiden Weibo Corpus (LWC), and I’m now proud to announce it has become publicly available. The LWC is an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s Twitter-like microblogging service. It’s freely available online at http://lwc.daanvanesch.nl/.

Because I collected the data for the LWC in January 2012, the LWC contains many linguistic phenomena that may not be found in older corpora, such as suffixation with “-ing”, an aspectual marker borrowed from English (covered on the Log and on Pinyin.info). Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and their location. This means the LWC can show how often words are used in different provinces and cities across China, which may be useful if you’ve always been wondering where that pesky 方 word in your dictionary is really used :)

Naturally, the LWC also supports searching for single words or grammar patterns, such as “any verb followed by an aspectual particle and then a noun”, or “any 被 construction followed by a noun and a verb“. Students and teachers of Mandarin who are looking for example sentences may like this feature.

Another feature you may like is the map of China where you can click every city to see what its Sina Weibo users posted back in January. And there’s more, so why not go and explore for yourself? I’d love to hear what you think!

In the next few weeks, I’ll be posting about a few interesting  words or grammatical phenomena I came across in the LWC. For example, did you know that -men 们 is very commonly attached to entire noun phrases? Here’s a few examples to whet your appetite:

  • shēngbìng de men 生病的们 ‘people who are ill’
  • zài zuò chīkuáng mèng de men 在做痴狂梦的们  ‘people who are having crazy dreams’
  • ài fā bú ài fā duǎnxìn de men 爱不爱发短信的们 ‘people who like sending text messages and people who dislike sending text messages’
  • zuò èr hào xiàn de men 做二号线的们 ‘people who regularly take subway line 2’.

Mysteriously, you even find shéimen kàndào le jiù zhùfú shéimen! 谁们看到了,就祝福谁们!which seems to mean something like ‘congratulations to everyone who saw this!’ – but really?! ye olde shéi 谁 with -men 们?

I’d love to hear what you think, both about these examples and about the LWC! What interesting stuff can you find?


8 responses to “New linguistic corpus of Sina Weibo messages”

  1. You just made my inner linguistic nerd extremely happy. This is awesome! Thanks so much for this.

  2. Alexis says:

    COOL! I just took a corpus linguistics class, though it was all in English. This looks like an awesome resource. Thanks for putting it up! Can’t wait to see what you do with it. :)

  3. Katie says:

    Really, really cool. I guess China’s real ID requirements on the internet are good for something :)

  4. Nicki says:

    “zuò èr hào xiàn de men 做二号线的门 ”

    Should this one be 们 like the others?

  5. Daan says:

    Thanks for the nice words everyone! Glad to see you’re enjoying the database. There’s really been much more interest than I’d ever expected :)

    Nicki, good catch, thanks! Fixed. And Katie, the data for the corpus was actually collected in January, before the current real-name registration system was implemented. But even before the 實名制, users already had to tell Sina Weibo in which city they live :)

  6. John Pasden says:

    Great job, Daan! This is really excellent. Would be great to see some stats on 网络语言, 错别字, etc. (the kind of stuff you’d expect to see more of in a Weibo corpus). Of course, some of that wouldn’t be too easy to run stats on…

  7. Daan says:

    Thanks, John! That’s a good idea for a Sinoglot post – I’ll see if I can put something together to find 错别字 sometime soon.

  8. Kaiwen says:

    Cool resource–definitely had not encountered that use of -们 before today.

    -ING is not a new construction, at least not in Taiwan Mandarin. I saw it as an MSN messenger status as early as 2004 (睡觉ing, 读书ing, 发呆ing), and the popular 五月天 song 《恋爱ING》came out in 2005 as far as I know.

Leave a Reply