New linguistic corpus of Sina Weibo messages

While Kellen and Steve are still working hard on their fascinating new project, I just wanted to tell Sinoglot readers about my new corpus of Sina Weibo messages.

In the past few months, I’ve been building the Leiden Weibo Corpus (LWC), and I’m now proud to announce it has become publicly available. The LWC is an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s Twitter-like microblogging service. It’s freely available online at http://lwc.daanvanesch.nl/.

Because I collected the data for the LWC in January 2012, the LWC contains many linguistic phenomena that may not be found in older corpora, such as suffixation with “-ing”, an aspectual marker borrowed from English (covered on the Log and on Pinyin.info). Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and their location. This means the LWC can show how often words are used in different provinces and cities across China, which may be useful if you’ve always been wondering where that pesky 方 word in your dictionary is really used :) Continue…

Interview with authors of 500 Common Chinese Idioms

Full disclosure: Sinoglot earns not even 一分钱 (one cent) if you click on the link below and buy the book. However, we do accumulate good vibes from the improvement of Zhonglish around the world.

Title: 500 Common Chinese Idioms (成语五百条)

I first found out about this book from Carl Gene, who gave it a ringing endorsement. When I received it for Christmas last year and started thumbing through, it wasn’t hard to see why: they have done chengyu right for the second language learner! The 500 are selected by frequency from six corpuses* of spoken and written language. For each chengyu, two example sentences are constructed – and very well constructed! And of course there is lots of detailed explanation about history and usage.

I was so smitten I wrote the authors a mash letter and asked for a Sinoglot** interview, which they were kind enough to accede to. Ladies and gentlemen, please welcome Liwei Jiao and Cornelius Kubler: Continue…

Hack free, up and running

As Kellen mentioned, we’ve been cleaning up the Sinoglot servers since our friends, the script hackers, visited.

We are now at a new hosting service and have sterilized every garment that we brought over. You may notice that some garments haven’t quite made it yet: broken links, no more email subscriptions, etc. Apologies. We’re working on it. In the meantime, if you notice something missing, let us know.

Now it’s back to our eclectic, sporadic posting.

Hacked!

Much like what happened to Sinosplice a while back, Sinoglot has been hacked. In fact it was the exact same hack that hit John’s site.

The hack is the eval(base64_decode javascript injection that you may be familiar with if you’re a web developer. It is fairly benign, as hacks go. You’ll notice it as a user when your browser is suddenly redirected to a website in Russia which you had no intention to visit. Otherwise it’s business as usual.

What does this mean for readers?
It’s Saturday and a busy one at that. Steve and I are working to contain the problem and get clean installs of WP and other CMSs up, stripped of the offending files. The source of the problem, again mirroring John Pasden’s case, seems to be an outdated WP install that got left on the server.

Again, this hack operated by inserting a Javascript redirect into php files, so there should be no risk to people reading the site. If you’re concerned, just make sure your anti-virus software is up to date.

We will probably not be migrating to another server, as that won’t prevent future attacks. Instead we’re re-installing the different content management systems. As a result there may be some down time this weekend.

Fortunately, we were able to find the problem, so now it’s just a matter of repairing the damage done. Unfortunately, those repairs are time consuming. Please bear with us as we get this all cleared up. We thank you in advance for your patience.

Dyslexia

I’ve had this conversation a dozen times with friends. Chances are so have you. It starts when someone asks if there’s such a thing as dyslexia in the Mandarin-speaking world. After all, if dyslexia means mixing up letters, and there are no letters, then there must be no dyslexia, right?

This came across my Twitter radar earlier today. It was a quick conversation between Matthew Stinson and Kane Gao, the latter having provided fodder for posts in the past. Today it was about why there seems to be few (if any) known cases (in the mass-consumer English speaking world) of dyslexia among Chinese speakers.

Continue…

Dialects & Kong Qingdong

It’s hard to research 方言. You want to talk to someone from outside Yangzhou about their 语言, about whether or not it’s 吴语. The term 吴语 inevitably causes confusion, and so you specify, but not by using the one thing you know would get to the point most quickly. You know you could just rephrase it as 吴方言 and that’d make things perfectly clear. But you resent the term 方言. So you say, “No, you know, 吴国的语言” but of course that doesn’t help either. “上海话,苏州话,温州话等。都是吴语” you say. “Ohhh. You mean 吴方言!” your interlocutor says.

So you give in. Maybe you argue that 方言 can be 语言 too. You tell him that in Tang times, 维语 was called a 方言, and that at times even English was called 方言 in official texts. But probably you don’t. Probably you just accept it and move on, knowing from experience that there’s little point in arguing this point.

Continue…

Bloody Fish

Regular readers may have noted that I’ve published mercifully little correspondence over recent weeks. To be honest, I’ve been a bit slow catching up with the backlog, and Auntie’s continuing health problems mean that I’ve little choice but to throw all but the most urgent items in the bin.

Anyway, the following item struck me as serious enough to read to Auntie when I visited her this morning. Unfortunately, as soon as I reached the part about the Canadian, she stuck her fingers in her ears and began mumbling something unprintable. Shortly afterwards, the doctor had to come and administer her medication. Continue…

Number Taboos in Sino-Korean

This post is an exploration into a bit of Sino-Korean etymology and usage of certain vocabulary.

On the 22nd I wrote about the use of F in place of 4 on elevator keypads, even when it comes to Braille. Zrv made a good point about the pronunciation of 四 and 死, both 사 (sa) in Korean. From his comment:

I think it’s really not accurate to say that the homophones in this case are in a “foreign language”. Sino-Korean words are as much Korean as Latin-English words (like “very”) or Franco-English words (like “enter”) are English.

That’s absolutely true. While a Cantonese speaker would likely understand much of what was said around them while in Seoul, it’s all still Korean.

However, aware of the 죽다 verb form that’s most commonly used for “to die”, I wanted to look into the homophones. The question I left in the comments is this: By modern standards, can we consider ‘death’ and ‘four’ homophonous in Korean if 죽다 is the preferred word?

Continue…

Healthy Teeth Sanzijing

One thing that struck me early upon arriving in China and immersing myself in the language (almost ten years ago!) was how modern Chinese is permeated with classical Chinese.  I soon came to the horrific realization that if I were to learn Chinese beyond a basic level, I would have to accept this fact.   Of course the most common way this shows up is in chengyu, but we see references to this older language everywhere, especially if we examine how school kids are taught.

One thing that horrified me as a parent was that my kids were asked to blindly memorize many long classics.  One of these is the Three Character Classic.  Because of its three character limitation, it has less possibility for variation in syntax.  There are only these four possibilities for phrase structure (where a repeated letter represents a multi-syllable word (the number of letters equals the number of syllables) and a single letter represents a one syllable word):  XXX, XXY, XYY, and XYZ.

I’m not opposed to studying these things; there is a lot of wisdom in them.  But the way they are normally studied is ridiculous:  they are memorized with very little explanation and recited in a banal rhythm at high speed.  And that’s that.  And they seem to be brainwashed into thinking that by doing so, these treasure troves of ancient wisdom will become part of them, slowly infusing them with beneficence throughout their lives.

Colgate seems to have picked up on this, and has made their own version. Continue…