Researchers around the world are digging into the trove of data available from our Fact-Check Insights project, with plans to use it for everything from benchmarking the performance of large language models to studying the moderation of humor online.
Since launching in December, the Fact-Check Insights dataset has been downloaded more than 300 times. The dataset is updated daily and made available at no cost to researchers by the Duke Reporters’ Lab, with support from the Google News Initiative.
Fact-Check Insights contains structured data from more than 240,000 claims made by political figures and social media accounts that have been analyzed and rated by independent fact-checkers. The dataset is powered by ClaimReview and MediaReview, twin tagging systems that allow fact-checkers to enter standardized data about their fact-checks, such as the statement being checked, the speaker, the date, and the rating.
Users have found great value in the data.
Marcin Sawiński, a researcher in the Department of Information Systems at the Poznań University of Economics and Business in Poland, is part of a team using ClaimReview data for a multiyear project aimed at developing a tool to assess the credibility of online sources and detect false information using AI.
“With nearly a quarter of a million items reviewed by hundreds of fact-checking organizations worldwide, we gain instant access to a vast portion of the fact-checking output from the past several years,” Sawiński writes. “Manually tracking such a large number of fact-checking websites and performing web data extraction would be prohibitively labor-intensive. The ready-made dataset enables us to conduct comprehensive cross-lingual and cross-regional analyses of fake-news narratives with much less effort.”
The OpenFact project, which is financed by the National Center for Research and Development in Poland, uses natural language processing and machine learning techniques to focus on specific topics.
“Shifting our efforts from direct web data extraction to the cleanup, disambiguation, and harmonization of ClaimReview data has significantly reduced our workload and increased our reach,” Sawiński writes.
Other researchers who have downloaded the dataset plan to use it for benchmarking the performance of large language models for fact-checking uses. Others are investigating the response to false information by social media platforms.
Ariadna Matamoros-Fernández, a senior lecturer in digital media in the School of Communication at Queensland University of Technology in Australia, plans to use the Fact-Check Insights dataset as part of her research into identifying and moderating humor on digital platforms.
“I am using the dataset to find concrete examples of humorous posts that have been fact-checked to discuss these examples in interviews with factcheckers,” Matamoros-Fernández writes. “I am also using the dataset to use examples of posts that have been flagged as being satire, memes, humour, parody etc to test whether different foundation models (GPT4/Gemini) are good at assessing these posts.”
The goals of her research include trying to “better understand the dynamics of harmful humour online” and creating best practices to tackle them. She has received a Discovery Early Career Researcher Award from the Australian Research Council to support her work.
Rafael Aparecido Martins Frade, a doctoral student working with the Spanish fact-checking organization Newtral, plans to utilize the data in his research on using AI to tackle disinformation.
“I am currently researching automated fact-checking, namely multi-modal claim matching,” he writes of his work. “The objective is to develop models and mechanisms to help fight the spread of fake news. Some of the applications we’re planning to work on are euroscepticism, climate emergency and health.”
Researchers who have downloaded the Fact-Check Insights dataset have also provided the Reporters’ Lab with feedback on making the data more usable.
Enrico Zuccolotto, a master’s degree student in artificial intelligence at the Polytechnic University of Milan, performed a thorough review of the dataset, offering suggestions aimed at reducing duplication and filling in missing data.
While the data available from Fact-Check Insights is primarily presented in the original form submitted by fact-checking organizations, the Reporters’ Lab has made small attempts to enhance the data’s clarity, and we will continue to make such adjustments where feasible.
Researchers who have questions about the dataset can refer to the “Guide to the Data” page, which includes a table outlining the fields included, along with examples (see the “What you can expect when you download the data” section). The Fact-Check Insights dataset is available for download in JSON and CSV formats.
Access is free for researchers, journalists, technologists and others in the field, but registration is required.