Phonemica: a panorama of Chinese
Kellen and I are very excited to announce, first to our Sinoglot friends, the beta launch of an entirely new project* that we hope will be a rich source of scholarship, activity, and (geeky) entertainment for years to come.
Phonemica (乡音苑, xiāngyīnyuàn), to quote the tagline, is “a panorama of Chinese, painted by its speakers through their stories.” In less poetic terms, the website is a group-sourced collection of carefully transcribed, high-quality recordings of both Standard Mandarin (putonghua) and local varieties of Chinese.
Since it’s up and running, you could (should?) skip my description below and get started:
- Get a username (so you can edit transcripts)
- Listen to some recordings
- Read the Get Involved page for an intro, then go edit a transcript: putonghua, or, say, Changzhou dialect (a form of Wu)
- Subscribe to the Phonemica blog**
Overview
But if you want to get acquainted with the Phonemica concept first, the front page map is a good place to start. Here’s how it looks today with 12 recordings:
As you might imagine, each flag on the map indicates a recording. The location of the flag is the speaker’s hometown — where they grew up. If the interviewee told a story in their local version of Chinese, it’s flagged as “土 vernacular”. If it’s a version of Standard Mandarin, it gets the “普 putonghua” flag.
From the map and description, some of the long-term goals are probably evident — that Phonemica will provide…
- A language tour of Chinese. By this I mean simply that you’ll be able to fly, virtually, to every corner of the world where a version of Chinese is spoken, listening to the sounds. How different from Mandarin is Sichuan “dialect”, really? What does the putonghua of a speaker from Hunan sound like?
- Linguistic analysis. Many scholars and observers have written about differences between varieties of Chinese, differences in how Standard Mandarin is spoken from locale to locale, and so on. We want to build on that past work and pair it up with empirical data. How often does a Hunan putonghua speaker overcorrect /c/ to /ch/? Phonemica will have the analysis, and the samples.
- Language preservation. Anyone in China with an interest in language diversity is keenly aware that economic and social forces are rapidly sweeping away even sizable communities of speakers. Kellen explored this some in his blog about Wu Chinese, especially the variety spoken in Changzhou. On Beijing Sounds, I’ve talked a bit about distinct language communities that exist even within what is called “Beijing”. Phonemica is partly an effort to capture the enormous variety that still exists.
Considering the goals, it’s equally evident that the Phonemica undertaking is, well, hefty. We limit the scope somewhat as follows:
- Recordings only in the Chinese family of languages: Mandarin, Wu, Yue, Xiang, Hakka, Southern Min, Gan and Northern Min
- Linguistic analysis mostly limited to Standard Mandarin, for the time being
- Single-speaker recordings only (though obviously with questions from an interviewer); no dialogs or group recordings
- Native or near-native speakers only
That said, the endeavor is still epic, so I’ll talk a bit about how we’re approaching it through group collaboration.
How Phonemica is possible
Only the participation of many will make this project possible, so we’ve designed everything as a group effort. Start by taking a look at the collectively-edited transcript. Here’s a snippet from our first recording:
- transcription in Chinese characters
- a romanization (e.g. Hanyu Pinyin for Mandarin)
- IPA
- a translation into English.
Every piece, 1-4, is editable online by any user — just click on the segment (when signed in), make your changes, and save. Voila! Do as little or as much as you want. Those who want to focus on any particular task (say, just translation) are welcome to do that, or you can attack everything at once.
But the collaboration doesn’t stop at the transcript. Recordings, too, will be group-sourced. We have drafted some standards and a system, and eventually envision having dozens of people involved in gathering high-quality recordings. In particular, we will be working to get donated recording equipment and enlist local help from both interested foreigners and Chinese. One of the ideas we may borrow from Basic Oral Language Documentation, for example, is to collaborate with university programs and send students back to their hometowns with borrowed recorders, all the better to encourage recordings in local vernaculars.
Help wanted
That’s the grand plan, but we need to get people involved and helping us with the working and thinking. That’s where we’re hoping some of our Sinoglot readers come in. Of course we expect 😉 you to be excited about the basic tasks of recording, editing and translating — but if you want to be further involved, here are just a few areas we know we need help in:
- Making contacts with academic programs both in China and abroad
- Getting leads on possible sponsors & donations of audio equipment
- Native Mandarin speakers to help with translation of Phonemica pages into Mandarin, both simplified and traditional characters
As these activities get off the ground, we’ll also need management and logistical help. If you think of something else you can do, send us a note (“steve” or “kellen” <at> phonemica.net). Thanks for bearing with the slowness of the Sinoglot blogging for the last few months. We hope you’ll enjoy getting your fingers sticky in the new project.
——–
PS: Infrequently Asked Questions
I’ve got something I want to use Phonemica content for, can I? Probably yes. One of our goals, in fact, is to create great recordings, transcripts and so on, then to encourage people to use them in ways we haven’t thought of or don’t have time for. To that end we’re making all content on Phonemica — including recordings, transcriptions, comments, blogposts — available for public use under this Creative Commons copyright license, which should cover almost any non-profit / educational usage. If in doubt, send us a note.
What’s Phonemica built on? Well, it’s not software you’ll find on a virtual shelf somewhere. Kellen has built from scratch everything you see on the site — editing system, scrolling transcript, audio playback, history tracking, user functionality etc. — and a lot more that you can’t see that’s even more impressive: audio segmenting comes to mind. There’s a lot of great stuff in the pipeline, but feel free to send us thoughts about what you’d most like to see.
What’s not working yet in Beta? Plenty. The most obvious thing for many users in China, where Internet Explorer reigns supreme, is that the site doesn’t work at all with IE < 9.0. This is a big limiter to Chinese participation, but it’s also a programming nightmare, so we’re not sure yet how to approach it. Another critical piece that’s missing right now is the tools for doing linguistic analysis on putonghua recordings, marking features such as L/N swapping, sh/s blending, and so on. Kellen’s actually programmed a lot of it, but it will still be a while before it’s ready for primetime. Beyond that, there’s loads of smaller stuff. Eventually I’m hoping to make public a list of developments that are in the pipeline.
——–
*Yes, a project that was pre-announced at least as early as Dec 12 when I said we were “about to unveil” this project. If you held your breath, please have the bereaved contact our legal department for appropriate compensation.
**Another blog?! Believe me, the Phonemica blog will be a service to those Sinoglot readers who want to keep out of the details. No doubt Sinoglot will have plenty of posts that reference Phonemica work, but we’ll try to keep nuts-and-bolts articles (e.g. “how to get a good recording”) limited to the Phonemica blog.
Many congratulations, Steve and Kellen! This is going to be awesome. I look forward to listening to all the recordings you’ve already put up, and I’ll certainly be contributing to Phonemica in the near future
thanks! we’ll be checking your profile regularly to see that you’re doing plenty of work 😉
Was there any thought that “土“ might not be the best character to use?
ha. you have no idea
What a fantastic idea. I am really looking forward to seeing it in action.
Although I think a lot of the people who would potentially put lots of work into a such a project, like the guys who contribute to Cantodict and Taiwanese experts on Southern Min, would not be attracted by your choice of the character ‘土’. How about ‘另’?
ok translit and tezuk: 土 is out, 方 is in. hope no one else is offended…
A lot of discussion (over the past two years) went into the terminology used throughout the site. It’s tough to be truly neutral in this situation, since there’s always going to be someone unhappy with which word was used for which whatever. We’ve changed the pin to 方, but I have a feeling this won’t be the last we hear of it. Even the many Chinese consultants we’ve spoken to about this don’t all agree on which term has which meaning.
Something to keep in mind. In the mean time we’ll keep making refinements that would make John Stuart Mill proud.
Not offended, just felt that perhaps some people would be. I agree 方 is a more neutral term.
Again, congratulations. I am very envious of people who are good with languages and building websites! I look forward to contributing.
I assumed 土 was chosen deliberately to avoid 方, so I’m a little surprised. There’s probably no great solutions in a sensitive topic like this if you want to keep things as simple as possible. I think a 官/普 mark vs. something like 地/外/別/另/非 is obviously another choice you have, but I haven’t personally though too much about it. I won’t bug you about it anymore. However, I will say
“If it’s a version of Standard Mandarin, it gets the “普 putonghua” flag.” /
如果是用普通话讲述的,就标示“普”字。
seems almost dangerously vague. What if their 方 is a version of 普? This runs the complex gamut of deciding how to classify 北京話 to whole areas of 官方話 which might as well be Russian. Are the speakers from Beijing supposed to try to speak authentic 北京話 for their 方 recording and CCTV-speak for their 普 recording? Doesn’t 北京話 already qualify as 普?
Why are the labels even required for the flags? Why not just have a flag and then let the recorder write down their own label(s) for what they are speaking? That way you click on the flag and can see how they chose to label it.
I’m just a bit confused/worried. If you have the case that you get a bunch of people from all over speaking very, very similar CCTV 普通話, and you’re caught listening for very, very tiny things, as opposed to asking people to speak their own local attempts at 普通話, which may or may not be the same as their 方 recording, in the case of something like 北京話.
Perhaps I’m too overly concerned or perhaps this genuinely needs to be clarified and fleshed out more.
Good question:
Since with something like this, stands must be taken, this is the stand we’ve chosen: Wu, Cantonese, Min are not Mandarin dialects. A Shanghainese speaker speaking Mandarin would fall under 普 while the same speaker speaking Wu would be 方/土. We have one recording up already of a Uyghur man speaking Mandarin, however outside of the recorded conversation, he mentioned that this is not the way he speaks Mandarin with his friends. The speaker from Northern Jiangsu is a friend of mine, and while the recording is how she speaks to her former high school classmates, it’s not how she speaks to me.
It’s not perfect, but basically, if the speaker is making efforts to standardise their 北方話, we’re calling it 普通話. If they’re not, or if it’s a different Sinitic language (e.g. Cantonese) then we’re calling it “other”. 普通話 is what they’d speak to coworkers from other areas or what they’d usually speak to the 外國人. This is all based on analysis done by Chinese linguists, so it’s not as simple as I’m putting it here, but this should help clear it up a little.
This gets more complex because we actually have two recordings from the Cantonese speaker. One is in her hometown dialect and one is of “Standard Cantonese” (her words, not mine). The decision to include 方言 that were different languages than Mandarin isn’t one we came to easily, but one that I think is right to have been made, given our initial goal in this project from when it started three/four years ago.
To add to what Kellen said, I just worked thru an example on the Phonemica blog
This looks great!
One thing that you might want to think about is trying to expand the content creation to non-native speakers of lower competence.
Perhaps a pre-made speech that tries to bring out as many potential regional pronunciation differences as possible (e.g. l/n, n/ng, z/zh etc. etc.). This would a) give people a starting point when going out and collecting recordings and b) it would allow people of lesser Mandarin proficiency to get useful recordings.
I know that if I went out and tried getting recordings of some of the more “authentic” Harbin natives I would only be able to upload – there’d be minimal transcription/translation, which doesn’t seem worth the effort (or at least impolite by assuming other contributors would deign to fill in the missing details).
Thanks!
Actually a few years back we started out with the idea of a pre-made speech. We decided to ditch this plan for a number of reasons. The short reason is that it would cause people to standardise their speech too much and we’d get very un-natural results. The number of things we’d end up able to record would be cut down drastically.
For your specific concern, it’s actually by design that someone else could do the transcribing. This entry, for example, was transcribed entirely by new registrants in the past two days. A lot of people may well enjoy transcribing the files but not have much interest in uploading. Others may rock at getting solid recordings but not have a high enough Mandarin level to transcribe, or they may simply not know IPA. Or they might be an IPA pro but not speak a word of Chinese.
What I’m trying to say is, get out there and record some Harbin hua! It’ll get transcribed one way or another.