Overview

The Measuring Hate Speech project aims to develop models, datasets, and theoretical frameworks capable of measuring hate speech.

The majority of work in automated hate speech detection treats hate speech as a binary phenomenon: a piece of text is either hate speech or is not. This limited perspective does not account multifaceted nature of hate speech or for disagreements among individuals as to what constitutes hate speech.

Using Rasch measurement theory, we have developed a continuous measurement scale for hate speech, capable of accomodating annotator perspective. By combining the measurement scale with large language models, we have developed tools that can measure the hatefulness of text at scale.

The Measuring Hate Speech project began in early 2017 at UC Berkeley’s D-Lab. We continue our work on this project, both in academic research and deploying our expertise, tools, and methods to those who could benefit from them via consultations. If you are interested in partnering with us, please reach out using the contact form at the bottom of the website.

Datasets and Models

Our datasets and models are freely and openly available on HuggingFace.

The Measuring Hate Speech dataset contains nearly 50,000 annotations, with over 10,000 unique social media comments (from YouTube, Reddit, and Twitter). These annotations contain responses to 10 survey items which span our hate speech construct, information about the targets of the speech, and annotator demographics.

Our neural network models are also available on HuggingFace. The code for these models is available on GitHub. The code to reproduce the work in our papers are also available on GitHub (see below).

Publications

We propose a general method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning …

We introduce the Measuring Hate Speech corpus, a dataset created to measure hate speech while adjusting for annotators’ perspectives. …

Annotators, by labeling data samples, play an essential role in the production of machine learning datasets. Their role is increasingly …

The past decade has seen an abundance of work seeking to detect, characterize, and measure online hate speech. A related, but less …

Research Team

Claudia von Vacano

Executive Director, D-Lab

Pratik Sachdeva

Senior Data Scientist, D-Lab

Tom van Nuenen

Senior Data Scientist, D-Lab

Chris Kennedy

Harvard Medical School

Partners

Contact

We are especially interested in presenting our work to broad audiences and exploring partnerships.