New linguistic corpus of Sina Weibo messages
While Kellen and Steve are still working hard on their fascinating new project, I just wanted to tell Sinoglot readers about my new corpus of Sina Weibo messages.
In the past few months, I’ve been building the Leiden Weibo Corpus (LWC), and I’m now proud to announce it has become publicly available. The LWC is an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s Twitter-like microblogging service. It’s freely available online at http://lwc.daanvanesch.nl/.
Because I collected the data for the LWC in January 2012, the LWC contains many linguistic phenomena that may not be found in older corpora, such as suffixation with “-ing”, an aspectual marker borrowed from English (covered on the Log and on Pinyin.info). Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and their location. This means the LWC can show how often words are used in different provinces and cities across China, which may be useful if you’ve always been wondering where that pesky 方 word in your dictionary is really used
Naturally, the LWC also supports searching for single words or grammar patterns, such as “any verb followed by an aspectual particle and then a noun”, or “any 被 construction followed by a noun and a verb“. Students and teachers of Mandarin who are looking for example sentences may like this feature.
Another feature you may like is the map of China where you can click every city to see what its Sina Weibo users posted back in January. And there’s more, so why not go and explore for yourself? I’d love to hear what you think!
In the next few weeks, I’ll be posting about a few interesting words or grammatical phenomena I came across in the LWC. For example, did you know that -men 们 is very commonly attached to entire noun phrases? Here’s a few examples to whet your appetite:
- shēngbìng de men 生病的们 ‘people who are ill’
- zài zuò chīkuáng mèng de men 在做痴狂梦的们 ‘people who are having crazy dreams’
- ài fā bú ài fā duǎnxìn de men 爱不爱发短信的们 ‘people who like sending text messages and people who dislike sending text messages’
- zuò èr hào xiàn de men 做二号线的们 ‘people who regularly take subway line 2’.
Mysteriously, you even find shéimen kàndào le jiù zhùfú shéimen! 谁们看到了，就祝福谁们！which seems to mean something like ‘congratulations to everyone who saw this!’ – but really?! ye olde shéi 谁 with -men 们?
I’d love to hear what you think, both about these examples and about the LWC! What interesting stuff can you find?