Carnegie Mellon Study on Censorship and Deletion Practices in Chinese Social Media

An interesting study was released earlier this month by a group at Carnegie Mellon University titled “censorship and deletion practices in Chinese social media.” According to the authors, “[w]hile much work has looked at efforts to prevent access to information in China (including IP blocking of foreign Web sites or search engine filtering), we present here the first large–scale analysis of political content censorship in social media, i.e., the active deletion of messages published by individuals.” Also, instead of focusing on “hard censorship” (DNS blocking of entire sites, blocking of access to specific search terms), the studytargets “soft censorship” (where users are allowed to post and to search politically sensitive topics, but where individual messages may be deleted soon afterwards by government censors).

The authors collected over 56 million messages broadcasted on Sina Weibo – China’s wildly popular micro blogging platform that is akin to a hybrid between Twitter and Facebook- over a three month period from June 27, 2011 to September 30, 2011. By querying Weibo’s public timeline API at fixed intervals, the authors collected 56,951,585 messages (approximately 600,000 messages per day). The authors then followed up three months later to see if the messages released on a specific day were deleted. 1

The authors then built a list of terms from these messages, forming an index of deletion rates for each term. The authors found that Weibo had a general deletion rate of 16.25% – which they considered to be the baseline deletion rate. 2 From the messages, the authors looked for terms that are deleted at a rate above the baseline rate. 3 Within this list of terms, the authors were able to find several politically-sensitive terms.

According the authors:

Several interesting categories emerge. One is the clear presence of known politically sensitive terms, such as 方滨兴 (Fang Binxing, the architect of the GFW), 真理部 (“Ministry of Truth,” a reference to state propaganda), and 法轮功 (Falun Gong, a banned spiritual group). Another is a set of terms that appear to have become sensitive due to changing real–world events. One example of this is the term 请辞 (to ask someone to resign); deleted messages containing this term call for the resignation of Sheng Guangzu, the Minister of Railways, following the Wenzhou train crash in July (Chan and Duncan, 2011). Another example is the term 两会 (two meetings): this term primarily denotes the joint annual meeting of the National People’s Congress and the Chinese People’s Political Consultative Conference, but also emerged as a code word for “planned protest” during the suppression of pro–democracy rallies in February and March 2011 (Kent, 2011).

The most topically cohesive of these are a set of terms found in messages relating to the false rumor of iodized salt preventing radiation poisoning following the Fukushima nuclear disaster of March 2011 (Burkitt, 2011). Highly deleted terms in this category include 防核 (nuclear defense/protection), 碘盐 (iodized salt), and 放射性碘 (radioactive iodine). Unlike other terms whose political sensitivity is relatively static, these are otherwise innocuous terms that become sensitive due to a dynamic real–world event: instructions by the Chinese government not to believe or spread the salt rumors (Xin, et al., 2011). Given recent government instructions to social media to quash false rumors in general (Chao, 2011), we believe that these abnormally high deletions constitute the first direct evidence for suppression of this rumor as well. In addition to specific messages relating to salt, we also observe more general news and commentary about the nuclear crisis appearing more frequently in deleted messages, leading to abnormally high deletion rates for terms such as “nuclear power plant,” “nuclear radiation,” and “Fukushima.”

However, just because some “politically sensitive terms” 4 were found does not mean we have established a case of government censorship. All we have proven is a potential government motive to delete. There is a special heightened need for caution here because as the authors suggested in their report (confirmed by the authors in private correspondence), only a few of the deletions identified are politically sensitive, with most actually appearing to have nothing to do with politics. Thus, the authors noted:

In the absence of external corroborating evidence (such as reports of the Chinese government actively suppressing salt rumors, as above), these results can only be suggestive, since we can never be certain that a deletion is due to the act of a censor rather than other reasons. In addition to terms commonly associated with spam, some terms appear frequently in cases of clear personal deletions; examples include the names of several holidays (e.g., 元宵节, the Lantern Festival), and expressions of condolences (节哀顺变). Given this range of deletion reasons, we turn to incorporating other lexical signals to focus on politically sensitive keywords.

The authors next focused on comparing Chinese tweets in Twitter with messages in Weibo obtained at around the same time. 5 By looking for terms that are most differentially expressed (messages that appeared in Twitter but not Weibo messages), the authors honed in on a list of terms that contained politically sensitive terms.

By focusing on the deletion patterns of known politically-sensitive terms, the authors showed tantalizing trails of censorship. For example, in late June/early July 2011, when wild rumors began circulating in Chinese cyberspace that Jiang Zemin, general secretary of the Communist Party of China from 1989 to 2002, had died, government censorship appeared to go into high gear. The figure below shows the number of deleted messages and total messages containing the phrase Jiang Zemin on Sina Weibo during a time when the Wall Street Journal, Guardian and other Western media sources reported that Jiang’s name (江泽民) had been blocked in searches on Sina Weibo.

Focusing on the most differentially expressed terms between Twitter and Weibo, the authors identified twenty politically sensitive terms, fourteen of which appear to be blocked (at one time or another) on Sina search. 6

Table 1: Search block status on 24 October 2011 of the 20 terms with the highest Twitter/Sina log likelihood ratio scores. Search blocked terms are noted with a †.
term gloss
何德普 He Depu
刘晓波 Liu Xiaobo
  北京市监狱 Beijing Municipal Prison
零八宪章 Charter 08
  廖廷娟 Liao Tingjuan
  廖筱君 Liao Hsiao–chun
共匪 communist bandit
李洪志 Li Hongzhi, founder of the Falun Gong spiritual movement
柴玲 Chai Ling
方滨兴 Fang Binxing
法轮功 Falun Gong
大纪元 Epoch Times
刘贤斌 Liu Xianbin
艾未未 Ai Weiwei, Chinese artist and activist
  王炳章 Wang Bingzhang
  非公式 unofficial/informal (Japanese)
魏京生 Wei Jingsheng, Beijing–based Chinese dissident
  唐柏桥 Tang Baiqiao
鲍彤 Bao Tong
退党 to withdraw from a political party

Honing in on terms known to be politically sensitive, the authors identified 17 terms that were associated with a statistically significant higher rates of deletion.

Table 2: Sensitive terms with statistically significant higher rates of message deletion (p < 0.001). Source designates whether the sensitive term originates in our Twitter LLR list (T) 7, Crandall, et al. (2007) (C) 8, or Wikipedia (Wikipedia, 2011) (W) 9.
δw deletions total term gloss source(s)
1.000 5 5 方滨兴 Fang Binxing T
1.000 5 5 真理部 Ministry of Truth T
0.875 7 8 法轮功 Falun Gong T
0.833 5 6 共匪 communist bandit T, W
0.717 38 53 盛雪 Sheng Xue C
0.500 13 26 法轮 Falun T, C, W
0.500 16 32 新语丝 New Threads C
0.379 145 383 反社会 antisociety C
0.374 199 532 江泽民 Jiang Zemin T, C, W
0.373 22 59 艾未未 Ai Weiwei T
0.273 41 150 不为人知的故事 “The Unknown Story” W
0.257 119 463 流亡 to be exiled W
0.255 82 321 驾崩 death of a king or emperor T
0.239 120 503 浏览 to browse C
0.227 112 493 花花公子 Playboy C, W
0.226 167 740 封锁 to blockade W
0.223 142 637 大法 (sc. Falun) Dafa W

However, while some known politically sensitive terms such as the above have been deleted, many terms known to be politically sensitive are not observed to be deleted at a statistically significant rate in the study, some of which are shown in Table 3.

Table 3: Deletion rates of terms from Crandall, et al. (2007), previously reported to be blocked by the GFC, that appear frequently (over 100 times) in our sample. Terms that are currently blocked on Sina’s search interface are noted with a †.
δw deletions total term gloss
0.20 88 443 中宣部 Central Propaganda Section
0.20 24 120 藏独 Tibetan independence (movement)
  0.19 30 154 民联 Democratic Alliance
0.18 132 733 迫害 to persecute
  0.18 124 686 酷刑 cruelty/torture
  0.18 80 457 钓鱼岛 Senkaku Islands
0.18 28 153 太子党 Crown Prince Party
0.17 102 592 法会 Falun Gong religious assembly
0.17 88 526 纪元 last two characters of Epoch Times
  0.17 56 333 民进党 DPP (Democratic Progressive Party, Taiwan)
  0.16 142 863 洗脑 brainwash
0.16 42 256 我的奋斗 Mein Kampf
0.15 83 567 学联 Student Federation
  0.15 32 208 高瞻 Gao Zhan
  0.14 51 360 无界 first two characters of circumventing browser
  0.14 36 250 正念 correct mindfulness
0.14 28 198 天葬 sky burial
  0.14 17 122 文字狱 censorship jail
  0.13 90 677 经文 scripture
0.12 91 732 八九 89 (the year of the Tiananmen Square Protest)
0.12 67 564 看中国 watching China, an Internet news Web site
0.11 35 310 明慧 Ming Hui (Web site of Falun Gong)
0.10 56 582 民运 democracy movement

The authors tested for the hypothesis that perhaps some terms escaped deletions because the messages did not possess a requisite impact factor or did not contain sufficiently strong anti-government content, but based on the data available, neither hypothesis appears to be true.

One area where the authors could draw a sharp distinction in deletion rates is the geography of origin. 10

The authors noted that messages from outlying restive areas of China are subjected to a much higher rate of deletion than inner, more stable areas. However, the authors again cautioned against jumping to the conclusion that this is due to political censorship. Table 4 lists the deletion rates for messages arising from the various regions of China. As can be seen, it undisputedly shows that the deletions are broad spectrum and not correlated with any politically sensitive terms.

Table 4: Overall deletion rate by province. δuniform is the deletion rate of random sample of all messages; δsensitive is the deletion rate of messages containing one of 295 known sensitive keywords. The sensitive message deletion rate has wider confidence bounds than the uniform deletion rate, but the two are correlated (Kendall’s τ = 0.77, Pearson r = 0.94).
  δuniform totaluniform δsensitive totalsensitive
Tibet 0.530 ±0.01998 2406 0.500 ±0.106 86
Qinghai 0.521 ±0.01944 2542 0.477 ±0.104 88
Ningxia 0.422 ±0.01826 2880 0.578 ±0.097 102
Macau 0.321 ±0.01817 2910 0.400 ±0.101 95
Gansu 0.285 ±0.01365 5156 0.301 ±0.074 176
Xinjiang 0.270 ±0.01203 6638 0.304 ±0.070 194
Hainan 0.265 ±0.00932 11068 0.316 ±0.0710 193
Inner Mongolia 0.263 ±0.01232 6332 0.278 ±0.068 209
Taiwan 0.239 ±0.01188 6803 0.260 ±0.061 254
Guizhou 0.226 ±0.00978 10050 0.186 ±0.047 431
Shanxi 0.222 ±0.01054 8646 0.260 ±0.057 296
Jilin 0.215 ±0.01017 9288 0.237 ±0.060 266
Jiangxi 0.207 ±0.00854 13161 0.233 ±0.053 343
Other China 0.202 ±0.00458 45805 0.216 ±0.027 1363
Heilongjiang 0.183 ±0.00850 13298 0.226 ±0.055 314
Guangxi 0.183 ±0.00632 24075 0.174 ±0.046 460
Yunnan 0.182 ±0.00859 13005 0.241 ±0.052 352
Hong Kong 0.178 ±0.00854 13170 0.241 ±0.041 585
Hebei 0.173 ±0.00768 16287 0.224 ±0.044 501
Guangdong 0.173 ±0.00154 407279 0.168 ±0.012 7097
Anhui 0.172 ±0.00794 15224 0.207 ±0.047 439
Fujian 0.171 ±0.00454 46542 0.166 ±0.031 1032
Chongqing 0.168 ±0.00643 23238 0.178 ±0.043 529
Hunan 0.164 ±0.00646 23031 0.210 ±0.040 596
Hubei 0.159 ±0.00546 32176 0.192 ±0.035 767
Outside China 0.155 ±0.00429 52069 0.215 ±0.023 1873
Tianjin 0.152 ±0.00767 16311 0.163 ±0.048 418
Henan 0.151 ±0.00636 23723 0.144 ±0.037 716
Shandong 0.145 ±0.00587 27838 0.141 ±0.034 838
Liaoning 0.141 ±0.00616 25339 0.148 ±0.038 681
Jiangsu 0.139 ±0.00413 56368 0.143 ±0.024 1619
Shaanxi 0.138 ±0.00722 18443 0.178 ±0.045 483
Sichuan 0.132 ±0.00477 42178 0.164 ±0.032 967
Zhejiang 0.129 ±0.00361 73752 0.147 ±0.023 1849
Beijing 0.120 ±0.00294 111456 0.122 ±0.015 4133
Shanghai 0.114 ±0.00310 99910 0.127 ±0.0185 3001

Even though the authors could not conclusively attribute these deletions to censorship, they also did not believe the differential deletion rates across regions to be the result of spam.11

I think this study is really interesting because it provides one of the earliest studies of Chinese censorship based on large-scale studies of data instead of “what has been suggested anecdotally,” as the authors would put it. While the study presents many tantalizing trails of censorship, caution must still be exercised so as not to overstate the conclusions.

Consider for example what the Director of Media Relations for the School of Computer Science of Carnegie Mellon wrote recently regarding the study:

Researchers in Carnegie Mellon University’s School of Computer Science analyzed millions of Chinese microblogs, or “weibos,” to uncover a set of politically sensitive terms that draw the attention of Chinese censors. Individual messages containing the terms were often deleted at rates that could vary based on current events or geography.

This is an example of overstatement.

The researchers did not really uncover a set of politically sensitive terms. They uncovered a list of terms that are deleted above a baseline rate, but only a small subset of which can be deemed politically sensitive terms, with the vast majority of the terms inexplicably not political at all. In fact, using data provided in private correspondence, after ordering the thousands of terms that are deleted above the baseline rate by statistical confidence, I saw that all but three of the most censored terms in Table 2 fell at the bottom half of the list. 12. That is, for each tantalizing politically sensitive term uncovered, there exists literally thousands of non-political terms such as 男生 (male), 女生 (female), 其实 (actually/in fact/really), 出来 (to come out/to emerge), 名字 (name – of a person or thing), 觉得 (to think/to feel), 在线 (online), 不是 (no/is not/not; fault/blame) that show statistically higher rates of deletion. 13

The Director continued:

The CMU study also showed high rates of weibo censorship in certain provinces. The phenomenon was particularly notable in Tibet, a hotbed of political unrest, where up to 53 percent of locally generated microblogs were deleted.

Again this is an overstatement, as discussed above regarding Table 4.  Differential rates of deletions across regions were observed across all types messages, with roughly equal rates of deletion for politically sensitive terms as well as non-politically sensitive terms.

In truth, the study probably raises more questions than it answers. But if we must draw concrete conclusions at this point in time, then we must tentatively conclude that censorship in China only appears to play a minor role in the large-scale deletion patterns in Chinese micro blogging platform. Deletions of messages are common in all cyberspaces – including on Twitter and Facebook – and politically sensitive terms do not appear to be deleted at statistically significantly rates than non-politically sensitive terms in China.

But perhaps another cut at the observation is to concede that in order to tweak out any conclusion about government censorship, it is important to understand first of how users on different micro blogging platforms, in different regions, use those platforms.

In a recent study on Twitter usage patterns, a group of researchers noted:

To study the dynamics of trends in social media, we have conducted a comprehensive study on trending topics on Twitter. … we found that the content that trended was largely news from traditional media sources, which are then amplified by repeated retweets on Twitter to generate trends.

Contrast this with a recent study on Chinese micro blogging patterns, for example, where a group of researchers noted:

We found that there is a vast difference in the content shared in China, when compared to a global social network such as Twitter. In China, the trends are created almost entirely due to retweets of media content such as jokes, images and videos, whereas on Twitter, the trends tend to have more to do with current global events and news stories.

Can the differences in usage patterns explain the differences in frequency of terms (e.g. Twitter vs. Weibo) and/or rates of deletion (e.g. across geography) observed above?

Consider also a recent study of 2000 tweets over a two-week period in August 2009, where a group noted that 40% of all tweets are “pointless babbles” and another 40% can best be described as “conversational.”  What are the consequences of such usage patterns?  What are the corresponding usage patterns in micro blogging platforms – from provinces to provinces – from nations to nations?

For people who believe the government must play a larger role than observed on the face of data observed here, perhaps they might want to consider how the government can affect general usage patterns through social and legal rather than just technical means. On the flip side, people must also reconcile the observation why despite presumed pervasive government censorship, Chinese cyberspace ranks amongst the most dynamic in the world. 14

It’s one thing to let our intuition guide us what things to dig for, but it’s quite another to let our intuition color the way we perceive facts.

In my correspondences with the authors, one of the authors observed:

There are definitely different usage biases between Twitter and Sina…. In our specific Twitter sample, many of the most prolific users tend to be news media, which again is biased in the type of things they tweet.  We use this to our advantage to identify particularly salient topics, but we should keep in mind that is a bias.

I’m sure that the overall censorship picture is far more nuanced than the dimensions of it that we studied — in addition to the potential differences in general social media usage that you note, one other interesting phenomenon to consider is self-censorship (users moderating what they say) and metaphor (circumlocutions to discuss something politically sensitive in an oblique way).  I don’t know of any studies of deletion practices in general on social media, but I’m sure that also interacts in complex ways with self-censorship too.  A lot of the most interesting questions are still waiting to be answered, and hopefully we’ve been able to contribute a little to this line of research.

That sounds about right.


  1. As the authors noted in the study, however, content-based removals by themselves are unremarkable.

    Facebook, for example, removes content that is “hateful, threatening, or pornographic; incites violence; or contains nudity or graphic or gratuitous violence” (Facebook, 2011). Aside from their own internal policies, social media organizations are also governed by the laws of the country in which they operate. In the United States, these include censoring the display of child pornography, libel, and media that infringe on copyright or other intellectual property rights; in China this extends to forms of political expression as well.

    Interestingly, earlier this year, Twitter announced that it will begin to selectively block tweets on a country by country basis. It noted that “[a]s we continue to grow internationally, we will enter countries that have different ideas about the contours of freedom of expression. Some…, for historical or cultural reasons, restrict certain types of content, such as France or Germany, which ban pro-Nazi content.” It also noted however that others “differ so much from our ideas that we will not be able to exist there.”

  2. The existence of deletions by themselves is not remarkable. As the authors noted: <blockquote>Messages can of course be deleted for a range of reasons, and by different actors: social media sites, Twitter included, routinely delete messages when policing spam; and users themselves delete their own messages and accounts for their own personal reasons.</blockquote>
  3. Depending on the confidence specified, the authors obtained from 1,715 to 3,046 terms that appear to be deleted above the baseline rate.
  4. In this study, terms were deemed to be politically sensitive if they correspond to terms generally known to be politically sensitive or if they are shown to be blocked in either Sina search or Weibo search
  5. From Twitter, the authors had obtained 11,079,704 tweets from the top 10,000 Chinese twitters.
  6. While these terms were blocked on sina search, on reviewing the data the author shared privately with me, I noted that none except for four (艾未未 (Ai Weiwei, Chinese artist and activist), 方滨兴 (Fang Binxing), 法轮功 (Falun Gong), 共匪 (communist bandit)) were actually detected to be deleted.
  7. Twitter LLR List: List of terms that are most differentially expressed comparing Twitter to Weibo messages
  8. Jedidiah R. Crandall, Daniel Zinn, Michael Byrd, Earl Barr, and Rich East. “ConceptDoppler: A weather tracker for Internet censorship,” CCS ’07: Proceedings of the 14th ACM Conference on Computer and Communications Security, pp. 352–365, and at, accessed 3 March 2012.
  10. According to the authors:

    As with Twitter, messages on Sina Weibo are attended with a range of metadata features, including free–text categories for user name and location and fixed-vocabulary categories for gender, country, province, and city. While users are free to enter any information they like here, true or false, this information can in the aggregate enable us to observe large–scale geographic trends both in the overall message pattern (Eisenstein, et al., 2010, 2011; O’Connor, et al., 2010; Wing and Baldridge, 2011) and in rates of deletion.

  11. As they noted in a private email with me, “it would seem unwise to mention “Falun Gong” in a weibo hawking shoes!”
  12. The only terms that appear in the top half of the deletion list were 反社会 (antisociety), 江泽民 (Jiang Zemi) – discussed extensively above, and 盛雪 (Sheng Xue – a human rights activist currently residing in Canada)
  13. I had suggested to the authors that perhaps some of these terms could be explained away if they can be shown to be correlated with politically terms (they happen to exist in messages that contains politically sensitive terms). The authors acknowledged that it would be a good direction for future research, admitting generally that “there are enough odd things in [the dataset] that I think it might require more explanation and consideration to be interpreted usefully.”
  14. As one researcher has noted: "The main question for me is how to understand this paradox that China has very tight Internet control, but at the same time very dynamic, lively and sometimes contentious Internet culture and politics. Here in the U.S., the Web is kind of a supplementary tool for social activism. But in China, cyberspace is really where all the action is, so to speak. Some of the most important and influential protest activities in recent years have happened mainly on the Internet."


Add new comment