Skip to main content

Allen School researchers find racial bias built into hate-speech detection


Top left to right: Sap, Gabriel, Smith; bottom left to right: Card, Choi

The volume of content posted on Facebook, YouTube, Twitter and other social media platforms every moment of the day, from all over the world, is monumental. Unfortunately, some of it is biased, hate-filled language targeting members of minority groups and often prompting violent action against them. Because it is impossible for human moderators to keep up with the volume of content generated in real-time, platforms are turning to artificial intelligence and machine learning to catch toxic language and stop it quickly. Regrettably, these toxic language finding tools have been found to suppress already marginalized voices. 

“Despite the benevolent intentions of most of these efforts, there’s actually a really big racial bias problem in hate speech detection right now,” said Maarten Sap, a Ph.D. student in the Allen School. “I’m not talking about the kind of bias you find in racist tweets or other forms of hate speech against minorities, instead the kind of bias I’m talking about is the kind that leads harmless tweets to be flagged as toxic when written by a minority population.”

In their paper, “The Risk of Hate Speech Detection,” presented at the recent Association for Computational Linguistics (ACL) meeting, Sap, fellow Ph.D. student Saadia Gabriel, professors Yejin Choi and Noah Smith of the Allen School and the Allen Institute for Artificial Intelligence, and Dallas Card at Carnegie Mellon University studied two different datasets of 124,779 tweets total  that were flagged for toxic language by a machine learning tool used by Twitter. What they found was widespread evidence of racial bias in how the tool characterized content. One of the datasets showed that the tool processing the tweets mistakenly reported 46% of non-offensive tweets written in African American English (AAE)–commonly spoken by black people in the US–as offensive, versus nine percent in general American English. The other dataset reported 26% of tweets in AAE as offensive when they were not, versus five percent of the general American English.

“I wasn’t aware of the exact level of bias in Perspective API — the tool used to detect online hate speech — when searching for toxic language, but I expected to see some level of bias from previous work that examined how easily algorithms like AI chatter bots learn negative cultural stereotypes and associations,” said Gabriel. “Still, it’s always surprising and a little alarming to see how well these algorithms pick up on toxic patterns pertaining to race and gender when presented with large corpora of unfiltered data from the web.”

This matters because ignoring the social context of the language, Sap said, harms minority populations by suppressing inoffensive speech. To address the biases displayed by the tool, the group changed the annotation, or the rules of reporting the hate speech. As an experiment, the researchers took 350 AAE tweets, and enlisted Amazon Mechanical Turkers for their help. 

Gabriel explained that on Amazon Mechanical Turk, researchers can set up tasks for workers to help with something like a research project or marketing effort. There are usually instructions and a set of criteria for the workers to consider, then a number of questions. 

“Here, you can tell workers specifically if there are particular things you want them to consider when thinking about the questions, for instance the tweet source,” she said. “Once the task goes up, anyone who is registered as a worker on Amazon Mechanical Turk can answer these questions. However, you can add qualifications to restrict the workers. We specified that all workers had to originate from the US since we’re considering US cultural norms and stereotypes.”

When given the tweets without background information, the Turkers reported that 55 percent of the tweets were offensive. When given the dialect and race of the tweeters, the Turkers reported that 44 percent of the tweets were offensive. The Turkers were also asked if they found the tweets personally offensive; only 33% of the posts were reported as such. This showed the researchers that priming the annotators with the source’s race and dialect influenced the labels, and also revealed that the annotations are nonobjective. 

“Our work serves as a reminder that hate speech and toxic language is highly subjective and contextual,” said Sap. “We have to think about dialect, slang and in-group versus out-group, and we have to consider that slurs spoken by the out-group might actually be reclaimed language when spoken by the in-group.”

While the study is concerning, Gabriel believes language processing machines can be taught to look at the source in order to prevent racial biases that result in the mischaracterization of content as hate speech that could lead to already marginalized voices being deplatformed.

“It’s not that these language processing machines are inventing biases, they’re learning them from the particular beliefs and norms we spread online. I think that in the same way that being more informed and having a more empathic view about differences between peoples can help us better understand our own biases and prevent them from having negative effects on those around us, injecting these kind of deeper insights into machine learning algorithms can have a significant difference on preventing racial bias,” she said. “For this, it is important to include more nuanced perspectives and greater context when doing natural language processing tasks like toxic language detection. We need to account for in-group norms and the deep complexities of our culture and history.”

To learn more, read the research paper here and watch a video of Sap’s ACL presentation here. Also see previous coverage of the project by Vox, Forbes, TechCrunch, New Scientist, Fortune, TechCrunch and MIT Technology Review.