Carnegie Mellon Study on Censorship and Deletion Practices in Chinese Social Media

An interesting study was released earlier this month by a group at Carnegie Mellon University titled “censorship and deletion practices in Chinese social media.” According to the authors, “[w]hile much work has looked at efforts to prevent access to information in China (including IP blocking of foreign Web sites or search engine filtering), we present here the first large–scale analysis of political content censorship in social media, i.e., the active deletion of messages published by individuals.” Also, instead of focusing on “hard censorship” (DNS blocking of entire sites, blocking of access to specific search terms), the studytargets “soft censorship” (where users are allowed to post and to search politically sensitive topics, but where individual messages may be deleted soon afterwards by government censors).

The authors collected over 56 million messages broadcasted on Sina Weibo – China’s wildly popular micro blogging platform that is akin to a hybrid between Twitter and Facebook- over a three month period from June 27, 2011 to September 30, 2011. By querying Weibo’s public timeline API at fixed intervals, the authors collected 56,951,585 messages (approximately 600,000 messages per day). The authors then followed up three months later to see if the messages released on a specific day were deleted. ¹

The authors then built a list of terms from these messages, forming an index of deletion rates for each term. The authors found that Weibo had a general deletion rate of 16.25% – which they considered to be the baseline deletion rate. ² From the messages, the authors looked for terms that are deleted at a rate above the baseline rate. ³ Within this list of terms, the authors were able to find several politically-sensitive terms.

According the authors:

Several interesting categories emerge. One is the clear presence of known politically sensitive terms, such as 方滨兴 (Fang Binxing, the architect of the GFW), 真理部 (“Ministry of Truth,” a reference to state propaganda), and 法轮功 (Falun Gong, a banned spiritual group). Another is a set of terms that appear to have become sensitive due to changing real–world events. One example of this is the term 请辞 (to ask someone to resign); deleted messages containing this term call for the resignation of Sheng Guangzu, the Minister of Railways, following the Wenzhou train crash in July (Chan and Duncan, 2011). Another example is the term 两会 (two meetings): this term primarily denotes the joint annual meeting of the National People’s Congress and the Chinese People’s Political Consultative Conference, but also emerged as a code word for “planned protest” during the suppression of pro–democracy rallies in February and March 2011 (Kent, 2011).

The most topically cohesive of these are a set of terms found in messages relating to the false rumor of iodized salt preventing radiation poisoning following the Fukushima nuclear disaster of March 2011 (Burkitt, 2011). Highly deleted terms in this category include 防核 (nuclear defense/protection), 碘盐 (iodized salt), and 放射性碘 (radioactive iodine). Unlike other terms whose political sensitivity is relatively static, these are otherwise innocuous terms that become sensitive due to a dynamic real–world event: instructions by the Chinese government not to believe or spread the salt rumors (Xin, et al., 2011). Given recent government instructions to social media to quash false rumors in general (Chao, 2011), we believe that these abnormally high deletions constitute the first direct evidence for suppression of this rumor as well. In addition to specific messages relating to salt, we also observe more general news and commentary about the nuclear crisis appearing more frequently in deleted messages, leading to abnormally high deletion rates for terms such as “nuclear power plant,” “nuclear radiation,” and “Fukushima.”

However, just because some “politically sensitive terms” ⁴ were found does not mean we have established a case of government censorship. All we have proven is a potential government motive to delete. There is a special heightened need for caution here because as the authors suggested in their report (confirmed by the authors in private correspondence), only a few of the deletions identified are politically sensitive, with most actually appearing to have nothing to do with politics. Thus, the authors noted:

In the absence of external corroborating evidence (such as reports of the Chinese government actively suppressing salt rumors, as above), these results can only be suggestive, since we can never be certain that a deletion is due to the act of a censor rather than other reasons. In addition to terms commonly associated with spam, some terms appear frequently in cases of clear personal deletions; examples include the names of several holidays (e.g., 元宵节, the Lantern Festival), and expressions of condolences (节哀顺变). Given this range of deletion reasons, we turn to incorporating other lexical signals to focus on politically sensitive keywords.

The authors next focused on comparing Chinese tweets in Twitter with messages in Weibo obtained at around the same time. ⁵ By looking for terms that are most differentially expressed (messages that appeared in Twitter but not Weibo messages), the authors honed in on a list of terms that contained politically sensitive terms.

By focusing on the deletion patterns of known politically-sensitive terms, the authors showed tantalizing trails of censorship. For example, in late June/early July 2011, when wild rumors began circulating in Chinese cyberspace that Jiang Zemin, general secretary of the Communist Party of China from 1989 to 2002, had died, government censorship appeared to go into high gear. The figure below shows the number of deleted messages and total messages containing the phrase Jiang Zemin on Sina Weibo during a time when the Wall Street Journal, Guardian and other Western media sources reported that Jiang’s name (江泽民) had been blocked in searches on Sina Weibo.

Focusing on the most differentially expressed terms between Twitter and Weibo, the authors identified twenty politically sensitive terms, fourteen of which appear to be blocked (at one time or another) on Sina search. ⁶

Table 1: Search block status on 24 October 2011 of the 20 terms with the highest Twitter/Sina log likelihood ratio scores. Search blocked terms are noted with a †.
†	term	gloss
†	何德普	He Depu
†	刘晓波	Liu Xiaobo
	北京市监狱	Beijing Municipal Prison
†	零八宪章	Charter 08
	廖廷娟	Liao Tingjuan
	廖筱君	Liao Hsiao–chun
†	共匪	communist bandit
†	李洪志	Li Hongzhi, founder of the Falun Gong spiritual movement
†	柴玲	Chai Ling
†	方滨兴	Fang Binxing
†	法轮功	Falun Gong
†	大纪元	Epoch Times
†	刘贤斌	Liu Xianbin
†	艾未未	Ai Weiwei, Chinese artist and activist
	王炳章	Wang Bingzhang
	非公式	unofficial/informal (Japanese)
†	魏京生	Wei Jingsheng, Beijing–based Chinese dissident
	唐柏桥	Tang Baiqiao
†	鲍彤	Bao Tong
†	退党	to withdraw from a political party

Honing in on terms known to be politically sensitive, the authors identified 17 terms that were associated with a statistically significant higher rates of deletion.

Table 2: Sensitive terms with statistically significant higher rates of message deletion (p < 0.001). Source designates whether the sensitive term originates in our Twitter LLR list (T) ⁷, Crandall, et al. (2007) (C) ⁸, or Wikipedia (Wikipedia, 2011) (W) ⁹.
*δ_w*	deletions	total	term	gloss	source(s)
1.000	5	5	方滨兴	Fang Binxing	T
1.000	5	5	真理部	Ministry of Truth	T
0.875	7	8	法轮功	Falun Gong	T
0.833	5	6	共匪	communist bandit	T, W
0.717	38	53	盛雪	Sheng Xue	C
0.500	13	26	法轮	Falun	T, C, W
0.500	16	32	新语丝	New Threads	C
0.379	145	383	反社会	antisociety	C
0.374	199	532	江泽民	Jiang Zemin	T, C, W
0.373	22	59	艾未未	Ai Weiwei	T
0.273	41	150	不为人知的故事	“The Unknown Story”	W
0.257	119	463	流亡	to be exiled	W
0.255	82	321	驾崩	death of a king or emperor	T
0.239	120	503	浏览	to browse	C
0.227	112	493	花花公子	Playboy	C, W
0.226	167	740	封锁	to blockade	W
0.223	142	637	大法	(sc. Falun) Dafa	W

However, while some known politically sensitive terms such as the above have been deleted, many terms known to be politically sensitive are not observed to be deleted at a statistically significant rate in the study, some of which are shown in Table 3.

Table 3: Deletion rates of terms from Crandall, et al. (2007), previously reported to be blocked by the GFC, that appear frequently (over 100 times) in our sample. Terms that are currently blocked on Sina’s search interface are noted with a †.
†	*δ_w*	deletions	total	term	gloss
†	0.20	88	443	中宣部	Central Propaganda Section
†	0.20	24	120	藏独	Tibetan independence (movement)
	0.19	30	154	民联	Democratic Alliance
†	0.18	132	733	迫害	to persecute
	0.18	124	686	酷刑	cruelty/torture
	0.18	80	457	钓鱼岛	Senkaku Islands
†	0.18	28	153	太子党	Crown Prince Party
†	0.17	102	592	法会	Falun Gong religious assembly
†	0.17	88	526	纪元	last two characters of Epoch Times
	0.17	56	333	民进党	DPP (Democratic Progressive Party, Taiwan)
	0.16	142	863	洗脑	brainwash
†	0.16	42	256	我的奋斗	Mein Kampf
†	0.15	83	567	学联	Student Federation
	0.15	32	208	高瞻	Gao Zhan
	0.14	51	360	无界	first two characters of circumventing browser
	0.14	36	250	正念	correct mindfulness
†	0.14	28	198	天葬	sky burial
	0.14	17	122	文字狱	censorship jail
	0.13	90	677	经文	scripture
†	0.12	91	732	八九	89 (the year of the Tiananmen Square Protest)
†	0.12	67	564	看中国	watching China, an Internet news Web site
†	0.11	35	310	明慧	Ming Hui (Web site of Falun Gong)
†	0.10	56	582	民运	democracy movement

The authors tested for the hypothesis that perhaps some terms escaped deletions because the messages did not possess a requisite impact factor or did not contain sufficiently strong anti-government content, but based on the data available, neither hypothesis appears to be true.

One area where the authors could draw a sharp distinction in deletion rates is the geography of origin. ¹⁰

The authors noted that messages from outlying restive areas of China are subjected to a much higher rate of deletion than inner, more stable areas. However, the authors again cautioned against jumping to the conclusion that this is due to political censorship. Table 4 lists the deletion rates for messages arising from the various regions of China. As can be seen, it undisputedly shows that the deletions are broad spectrum and not correlated with any politically sensitive terms.

Table 4: Overall deletion rate by province. δ_uniform is the deletion rate of random sample of all messages; δ_sensitive is the deletion rate of messages containing one of 295 known sensitive keywords. The sensitive message deletion rate has wider confidence bounds than the uniform deletion rate, but the two are correlated (Kendall’s τ = 0.77, Pearson r = 0.94).
	*δ_uniform*	total_uniform	*δ_sensitive*	total_sensitive
Tibet	0.530 ±0.01998	2406	0.500 ±0.106	86
Qinghai	0.521 ±0.01944	2542	0.477 ±0.104	88
Ningxia	0.422 ±0.01826	2880	0.578 ±0.097	102
Macau	0.321 ±0.01817	2910	0.400 ±0.101	95
Gansu	0.285 ±0.01365	5156	0.301 ±0.074	176
Xinjiang	0.270 ±0.01203	6638	0.304 ±0.070	194
Hainan	0.265 ±0.00932	11068	0.316 ±0.0710	193
Inner Mongolia	0.263 ±0.01232	6332	0.278 ±0.068	209
Taiwan	0.239 ±0.01188	6803	0.260 ±0.061	254
Guizhou	0.226 ±0.00978	10050	0.186 ±0.047	431
Shanxi	0.222 ±0.01054	8646	0.260 ±0.057	296
Jilin	0.215 ±0.01017	9288	0.237 ±0.060	266
Jiangxi	0.207 ±0.00854	13161	0.233 ±0.053	343
Other China	0.202 ±0.00458	45805	0.216 ±0.027	1363
Heilongjiang	0.183 ±0.00850	13298	0.226 ±0.055	314
Guangxi	0.183 ±0.00632	24075	0.174 ±0.046	460
Yunnan	0.182 ±0.00859	13005	0.241 ±0.052	352
Hong Kong	0.178 ±0.00854	13170	0.241 ±0.041	585
Hebei	0.173 ±0.00768	16287	0.224 ±0.044	501
Guangdong	0.173 ±0.00154	407279	0.168 ±0.012	7097
Anhui	0.172 ±0.00794	15224	0.207 ±0.047	439
Fujian	0.171 ±0.00454	46542	0.166 ±0.031	1032
Chongqing	0.168 ±0.00643	23238	0.178 ±0.043	529
Hunan	0.164 ±0.00646	23031	0.210 ±0.040	596
Hubei	0.159 ±0.00546	32176	0.192 ±0.035	767
Outside China	0.155 ±0.00429	52069	0.215 ±0.023	1873
Tianjin	0.152 ±0.00767	16311	0.163 ±0.048	418
Henan	0.151 ±0.00636	23723	0.144 ±0.037	716
Shandong	0.145 ±0.00587	27838	0.141 ±0.034	838
Liaoning	0.141 ±0.00616	25339	0.148 ±0.038	681
Jiangsu	0.139 ±0.00413	56368	0.143 ±0.024	1619
Shaanxi	0.138 ±0.00722	18443	0.178 ±0.045	483
Sichuan	0.132 ±0.00477	42178	0.164 ±0.032	967
Zhejiang	0.129 ±0.00361	73752	0.147 ±0.023	1849
Beijing	0.120 ±0.00294	111456	0.122 ±0.015	4133
Shanghai	0.114 ±0.00310	99910	0.127 ±0.0185	3001

Even though the authors could not conclusively attribute these deletions to censorship, they also did not believe the differential deletion rates across regions to be the result of spam.¹¹

I think this study is really interesting because it provides one of the earliest studies of Chinese censorship based on large-scale studies of data instead of “what has been suggested anecdotally,” as the authors would put it. While the study presents many tantalizing trails of censorship, caution must still be exercised so as not to overstate the conclusions.

Consider for example what the Director of Media Relations for the School of Computer Science of Carnegie Mellon wrote recently regarding the study:

Researchers in Carnegie Mellon University’s School of Computer Science analyzed millions of Chinese microblogs, or “weibos,” to uncover a set of politically sensitive terms that draw the attention of Chinese censors. Individual messages containing the terms were often deleted at rates that could vary based on current events or geography.

This is an example of overstatement.

The researchers did not really uncover a set of politically sensitive terms. They uncovered a list of terms that are deleted above a baseline rate, but only a small subset of which can be deemed politically sensitive terms, with the vast majority of the terms inexplicably not political at all. In fact, using data provided in private correspondence, after ordering the thousands of terms that are deleted above the baseline rate by statistical confidence, I saw that all but three of the most censored terms in Table 2 fell at the bottom half of the list. ¹². That is, for each tantalizing politically sensitive term uncovered, there exists literally thousands of non-political terms such as 男生 (male), 女生 (female), 其实 (actually/in fact/really), 出来 (to come out/to emerge), 名字 (name – of a person or thing), 觉得 (to think/to feel), 在线 (online), 不是 (no/is not/not; fault/blame) that show statistically higher rates of deletion. ¹³

The Director continued:

The CMU study also showed high rates of weibo censorship in certain provinces. The phenomenon was particularly notable in Tibet, a hotbed of political unrest, where up to 53 percent of locally generated microblogs were deleted.

Again this is an overstatement, as discussed above regarding Table 4. Differential rates of deletions across regions were observed across all types messages, with roughly equal rates of deletion for politically sensitive terms as well as non-politically sensitive terms.

In truth, the study probably raises more questions than it answers. But if we must draw concrete conclusions at this point in time, then we must tentatively conclude that censorship in China only appears to play a minor role in the large-scale deletion patterns in Chinese micro blogging platform. Deletions of messages are common in all cyberspaces – including on Twitter and Facebook – and politically sensitive terms do not appear to be deleted at statistically significantly rates than non-politically sensitive terms in China.

But perhaps another cut at the observation is to concede that in order to tweak out any conclusion about government censorship, it is important to understand first of how users on different micro blogging platforms, in different regions, use those platforms.

In a recent study on Twitter usage patterns, a group of researchers noted:

To study the dynamics of trends in social media, we have conducted a comprehensive study on trending topics on Twitter. … we found that the content that trended was largely news from traditional media sources, which are then amplified by repeated retweets on Twitter to generate trends.

Contrast this with a recent study on Chinese micro blogging patterns, for example, where a group of researchers noted:

We found that there is a vast difference in the content shared in China, when compared to a global social network such as Twitter. In China, the trends are created almost entirely due to retweets of media content such as jokes, images and videos, whereas on Twitter, the trends tend to have more to do with current global events and news stories.

Can the differences in usage patterns explain the differences in frequency of terms (e.g. Twitter vs. Weibo) and/or rates of deletion (e.g. across geography) observed above?

Consider also a recent study of 2000 tweets over a two-week period in August 2009, where a group noted that 40% of all tweets are “pointless babbles” and another 40% can best be described as “conversational.” What are the consequences of such usage patterns? What are the corresponding usage patterns in micro blogging platforms – from provinces to provinces – from nations to nations?

For people who believe the government must play a larger role than observed on the face of data observed here, perhaps they might want to consider how the government can affect general usage patterns through social and legal rather than just technical means. On the flip side, people must also reconcile the observation why despite presumed pervasive government censorship, Chinese cyberspace ranks amongst the most dynamic in the world. ¹⁴

It’s one thing to let our intuition guide us what things to dig for, but it’s quite another to let our intuition color the way we perceive facts.

In my correspondences with the authors, one of the authors observed:

There are definitely different usage biases between Twitter and Sina…. In our specific Twitter sample, many of the most prolific users tend to be news media, which again is biased in the type of things they tweet. We use this to our advantage to identify particularly salient topics, but we should keep in mind that is a bias.

…

I’m sure that the overall censorship picture is far more nuanced than the dimensions of it that we studied — in addition to the potential differences in general social media usage that you note, one other interesting phenomenon to consider is self-censorship (users moderating what they say) and metaphor (circumlocutions to discuss something politically sensitive in an oblique way). I don’t know of any studies of deletion practices in general on social media, but I’m sure that also interacts in complex ways with self-censorship too. A lot of the most interesting questions are still waiting to be answered, and hopefully we’ve been able to contribute a little to this line of research.

That sounds about right.

Notes:

As the authors noted in the study, however, content-based removals by themselves are unremarkable.

Facebook, for example, removes content that is “hateful, threatening, or pornographic; incites violence; or contains nudity or graphic or gratuitous violence” (Facebook, 2011). Aside from their own internal policies, social media organizations are also governed by the laws of the country in which they operate. In the United States, these include censoring the display of child pornography, libel, and media that infringe on copyright or other intellectual property rights; in China this extends to forms of political expression as well.

Interestingly, earlier this year, Twitter announced that it will begin to selectively block tweets on a country by country basis. It noted that “[a]s we continue to grow internationally, we will enter countries that have different ideas about the contours of freedom of expression. Some…, for historical or cultural reasons, restrict certain types of content, such as France or Germany, which ban pro-Nazi content.” It also noted however that others “differ so much from our ideas that we will not be able to exist there.” ↩

The existence of deletions by themselves is not remarkable. As the authors noted: <blockquote>Messages can of course be deleted for a range of reasons, and by different actors: social media sites, Twitter included, routinely delete messages when policing spam; and users themselves delete their own messages and accounts for their own personal reasons.</blockquote> ↩
Depending on the confidence specified, the authors obtained from 1,715 to 3,046 terms that appear to be deleted above the baseline rate. ↩
In this study, terms were deemed to be politically sensitive if they correspond to terms generally known to be politically sensitive or if they are shown to be blocked in either Sina search or Weibo search ↩
From Twitter, the authors had obtained 11,079,704 tweets from the top 10,000 Chinese twitters. ↩
While these terms were blocked on sina search, on reviewing the data the author shared privately with me, I noted that none except for four (艾未未 (Ai Weiwei, Chinese artist and activist), 方滨兴 (Fang Binxing), 法轮功 (Falun Gong), 共匪 (communist bandit)) were actually detected to be deleted. ↩
Twitter LLR List: List of terms that are most differentially expressed comparing Twitter to Weibo messages ↩
Jedidiah R. Crandall, Daniel Zinn, Michael Byrd, Earl Barr, and Rich East. “ConceptDoppler: A weather tracker for Internet censorship,” CCS ’07: Proceedings of the 14th ACM Conference on Computer and Communications Security, pp. 352–365, and at http://www.cs.ucdavis.edu/~barr/publications/conceptDoppler.pdf, accessed 3 March 2012. ↩
http://en.wikipedia.org/wiki/List_of_blacklisted_keywords_in_the_People%... ↩
According to the authors:

As with Twitter, messages on Sina Weibo are attended with a range of metadata features, including free–text categories for user name and location and fixed-vocabulary categories for gender, country, province, and city. While users are free to enter any information they like here, true or false, this information can in the aggregate enable us to observe large–scale geographic trends both in the overall message pattern (Eisenstein, et al., 2010, 2011; O’Connor, et al., 2010; Wing and Baldridge, 2011) and in rates of deletion. ↩

As they noted in a private email with me, “it would seem unwise to mention “Falun Gong” in a weibo hawking shoes!” ↩
The only terms that appear in the top half of the deletion list were 反社会 (antisociety), 江泽民 (Jiang Zemi) – discussed extensively above, and 盛雪 (Sheng Xue – a human rights activist currently residing in Canada) ↩
I had suggested to the authors that perhaps some of these terms could be explained away if they can be shown to be correlated with politically terms (they happen to exist in messages that contains politically sensitive terms). The authors acknowledged that it would be a good direction for future research, admitting generally that “there are enough odd things in [the dataset] that I think it might require more explanation and consideration to be interpreted usefully.” ↩
As one researcher has noted: "The main question for me is how to understand this paradox that China has very tight Internet control, but at the same time very dynamic, lively and sometimes contentious Internet culture and politics. Here in the U.S., the Web is kind of a supplementary tool for social activism. But in China, cyberspace is really where all the action is, so to speak. Some of the most important and influential protest activities in recent years have happened mainly on the Internet." ↩