Trustworthy science is crucial to scientific progress, evidence-based policies, and human health. Problems in research design, conduct, and reporting threaten scientific integrity, waste resources, and risk a loss of public trust in science. Citations play a fundamental role in diffusion of scientific knowledge[1] and research assessment[2]; yet their role in research integrity is often overlooked[3]. Citation inaccuracies (e.g., citation of non-existent findings[3]) undermine the integrity of the biomedical literature, distorting the perception of available evidence[4] with potentially serious consequences for human health. A recent meta-analysis showed that 25.4% of medical articles contained a citation error[5]. Retracted articles continue to be cited positively years after being retracted, spreading scientific misinformation[6,7]. A bibliometric analysis revealed that inaccurate citations of a letter published in 1980[8] may have contributed to the opioid crisis[9].
Considering such negative consequences, it is critical to ensure that the scientific contents of all referenced articles are accurately and properly cited in a manuscript before its publication. However, assessing citation accuracy requires considerable manual effort. Authors often take shortcuts, copying citations from other articles without checking their accuracy[10]. Journals and peer reviewers lack the resources to verify manuscripts for citation inaccuracies. Automated tools that can identify citation inaccuracies would help authors, journals, and peer reviewers in mitigating the negative consequences of citation distortions and improve transparency and integrity of scholarly communication[11].
The objective of this project is to develop scalable natural language processing (NLP) and artificial intelligence (AI) algorithms to automatically assess biomedical publications for citation content accuracy. The resulting models can be embedded in practical software tools. With these new tools, authors will be able to improve their citation quality; journals and peer reviewers will be able to scrutinize questionable citation practices pre-publication; and research administrators, research integrity officers, funders, and policymakers will be able to investigate citation practices, integrity issues, and knowledge diffusion via citations.
Toward this objective, in this first year of the project, we will construct a corpus of 100 highly cited articles based on PHS-funded research and 20 articles citing each of these articles, align the citation context in the citing article with the relevant text spans in the reference article, and assess whether the citation is accurate with respect to the reference article. In the second year, we will use the resulting corpus to train and validate AI-based NLP models that identify related text spans in reference-citing article pairs and calculate a confidence score for citation accuracy.
The proposed work is innovative because: (a) it is the first project focusing on automated citation accuracy verification in biomedical publications; (b) it tackles the significant NLP challenge of aligning the content of two articles at multiple levels of granularity (e.g., single sentence, passage, entire article). To address these challenges, we will leverage state-of-the-art sentence encoders, such as BERT[12], as well as long document encoders, such as Longformer[13], and a multilevel text alignment approach[14].
The models developed will serve various stakeholders in improving citation quality and ensuring citation transparency and integrity. In addition, the proposed corpus and models will stimulate research in citation content analysis[15], contributing to development of more granular and accurate measures of scholarly impact. In the longer term, such qualitative measures can mitigate the detrimental effects that purely quantitative metrics of research assessment have had on research integrity and quality[16].