AI Crushes It at Simplicity: GPT-4 Writes Science Summaries Better Than the Pros

cover
26 Nov 2024

Author:

(1) David M. Markowitz, Department of Communication, Michigan State University, East Lansing, MI 48824.

Editor's note: This is part 7 of 10 of a paper evaluating the effectiveness of using generative AI to simplify science communication and enhance public trust in science. The rest of the paper can be accessed via the table of links below.

Study 1b: Results

Distributions of the comparisons in this study are reflected in Figure 1. Indeed, GPT significance statements were written in a simpler manner than PNAS significance statements for the simplicity index, Welch’s t(1492.1) = 11.55, p < .001, Cohen’s d = 0.58, 95% CI [0.47, 0.69]. Specifically, GPT significance statements (M = 75.53%, SD = 5.57%) contained more common words than PNAS significance statements (M = 69.84%, SD = 7.45%), Welch’s t(1478.7) = 17.31, p < .001, Cohen’s d = 0.87, 95% CI [0.76, 0.97]. GPT significance statements (M = 17.59, SD = 11.15) were also more readable than PNAS significance statements (M = 12.86, SD = 14.27), Welch’s t(1510) = 7.39, p < .001, Cohen’s d = 0.37, 95% CI [0.27, 0.47]. However, GPT significance statements (M = 92.73, SD = 6.89) had a statistically equivalent analytic style as PNAS significance statements (M = 92.32, SD = 7.48), Welch’s t(1587.7) = 1.16, p = .246, Cohen’s d = 0.06, 95% CI [-0.04, 0.16]. All results were maintained when comparing GPT significance statements to PNAS abstracts and PNAS significance statements as well.

Alternative Explanations

One possible explanation for the Study 1b results is that there are content differences across the PNAS and GPT texts explaining or impacting such differences across groups. This concern was addressed in two ways. First, PNAS has various sections that authors submit to, and LIWC has categories to approximate words associated with such sections. For example, the LIWC category for political speech would approximate papers submitted the Social Science section, specifically Political Sciences. Several linguistic covariates were therefore examined to account for content-related differences across GPT and PNAS texts. After including overall affect/emotion and cognition (to control for topics within the Psychological Sciences section of PNAS), political speech (to control for topics within the Political Science section of PNAS), and physical references to the multivariate models (to control for topics within the Biological Sciences section of PNAS), all results were maintained except for Analytic writing, where GPT texts were more analytic than PNAS texts, which is also consistent with prior work (42). Please see the online supplement for additional LIWC differences across these text types.

Content effects were also evaluated in a bottom-up manner using the Meaning Extraction Method to measure dominant themes across the GPT and PNAS texts (43, 44). The evidence in the online supplement states there were 8 themes reliably extracted from the data, ranging from basic methodological and research information to gene expression and cancer science. Controlling for these themes, including the prior LIWC content dimensions, revealed consistent results as well (see supplement). Therefore, Study 1b evidence is robust to content.

Altogether, human authors write simpler for lay audiences than for scientific audiences (Study 1a), but Study 1b demonstrated artificial intelligence and large language models can do so more effectively (e.g., the effect size differences between GPT significance statements and PNAS significance statements was larger than human in Study 1a). The findings thus far are correlational and therefore need causal evidence to demonstrate the impact of these effects on human perceptions. In Study 2, participants were randomly assigned to read a GPT significance statement or PNAS significance statement from pairs of texts that appeared in the previous studies. Participants made perceptions about the author (e.g., intelligence, credibility, trustworthiness), judged the complexity of each text, and they rated how much they believed the author of each text was human or artificial intelligence. Only perceptions of the author were made because prior work suggests people generally report consistent ratings when asked about both scientists and their science in similar studies (9).

This paper is available on arxiv under CC BY 4.0 DEED license.