Are AI or Humans More Trustworthy? A Study Puts Readers to the Test

26 Nov 2024

Author:

(1) David M. Markowitz, Department of Communication, Michigan State University, East Lansing, MI 48824.

Editor's note: This is part 8 of 10 of a paper evaluating the effectiveness of using generative AI to simplify science communication and enhance public trust in science. The rest of the paper can be accessed via the table of links below.

Table of Links

Study 2: Method

Participants in the US were recruited from Prolific and paid $4.00 for their time in a short study (median completion time < 7 minutes). People were told that they would read scientific summaries and make judgments about the authors of such texts.

Participants and Power

Based on this study’s preregistration (https://aspredicted.org/C3K_T31), a minimum of 164 participants were required to detect a small effect powered at 80% in a within-subjects study (f = 0.10, α = two-tailed, three measurements). A total of 274 participants were recruited to ensure enough participants were in the study. Most participants self-identified as men (n = 139; 50.7%; women n = 127, other n = 7), they were 36.74 years old on average (SD = 12.47 years), and were mostly White (n = 190; 69.3%). On a 7-point political ideology scale (1 = extremely liberal, 7 = extremely conservative), participants leaned liberal (M = 2.97, SD = 1.63).

Procedure

Five pairs of stimuli from Study 1b were selected for the experiment, having had the greatest difference in common words scores between the PNAS and GPT texts (Pair 1 GPT = 79.31%, Pair 1 PNAS = 48.65%; Pair 2 GPT = 76.64%, Pair 2 PNAS = 46.32%; Pair 3 GPT = 79.07%, Pair 3 PNAS = 52.14%; Pair 4 GPT = 85.00%, Pair 4 PNAS = 59.66%; Pair 5 GPT = 87.10%, Pair 5 PNAS = 62.81%). Participants were randomly assigned to read stimuli from three out of a possible five pairs (see the online supplement for the stimuli texts), and within these randomly selected pairs, participants were randomly assigned to the GPT (simple) or PNAS (complex) version of each pair. Participants were told to read each summary of a scientific paper and then answer questions below each summary. They were specifically told “we are not expecting you to be an expert in the topic discussed below. Instead, make your judgments based on how the summary is written.”

Finally, participants made various perceptions of the author (e.g., intelligence, trustworthiness) based on prior work (14, 15, 34), judgments about the identity of who wrote the scientific summary (AI or human), and assessed the complexity in each text as a manipulation check. The order of these measures was randomized, and items within each block were randomized as well. This study was approved by the author’s university research ethics board.

Measures

Manipulation Check

Based on prior work (15, 34), three questions asked participants to rate how clear (“How clear was the writing in the summary you just read?”), complex (“How complex was the writing in the summary you just read?”), and how well they understood each scientific summary (“How much of this writing did you understand?”). Ratings for the first two questions were made on 7- point Likert-type scales from 1 = Not at all to 7 = Extremely. The third question ranged from 1 = Not at all to 7 = An enormous amount.

Author Perceptions

Participants made three ratings about the author of each scientific summary: (1) “How intelligent is the scientist who wrote this summary?”, (2) How credible is the scientist who wrote this summary?”, and (3) “How trustworthy is the scientist who wrote this summary?” As a collection, these dimensions were highly reliable (Cronbach’s α = 0.88) and therefore, they were averaged to create a general author perceptions index, while also being evaluated individually. All items were measured on 7-point Likert-type scales from 1 = Not at all to 7 = Extremely.

Author Identity Perceptions

Participants were asked for their agreement with two questions: (1) This summary was written by a human, and (2) This summary was written by Artificial Intelligence. All items were measured on 7-point Likert-type scales from 1 = Strongly disagree to 7 = Strongly agree.

Demographics

Basic demographic data were obtained from each participant, including their age, gender, ethnicity, and political ideology.

Analytic Plan

Since there were multiple observations per participant, linear mixed models with random intercepts for participant and stimulus were constructed (45, 46).

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

AI Crushes It at Simplicity: GPT-4 Writes Science Summaries Better Than the Pros

Up Next →

Could Science Benefit From AI-mediated Communication? This Study Says it Could