Inside the Numbers: How 34,584 Science Papers Reveal the Secrets of Simpler Writing

26 Nov 2024

Author:

(1) David M. Markowitz, Department of Communication, Michigan State University, East Lansing, MI 48824.

Editor's note: This is part 4 of 10 of a paper evaluating the effectiveness of using generative AI to simplify science communication and enhance public trust in science. The rest of the paper can be accessed via the table of links below.

Table of Links

Study 1a: Method

Data Collection

To first evaluate if lay summaries had a simpler linguistic style than scientific summaries, significance statements and academic abstracts were respectively extracted from the journal Proceedings of the National Academy of Sciences (PNAS). This journal was selected because it is a widely read, high-impact general science journal that was one of the first outlets to require authors to provide traditional scientific summaries (e.g., abstracts) and lay summaries that appeal to average readers. PNAS also has topical breadth, scale, and longevity relative to other journals that may require lay summaries in that significance statements began in 2012 (31).

A total of 42,022 publications were extracted from PNAS between January 2010 and March 2024 to capture possible papers that included both academic abstracts and significance statements. Only those with both summary types were included in this paper to create a yoked comparison within the same article. The final dataset included 34,584 papers (34,584 significance statements and 34,584 abstracts), totaling 10,799,256 words.

Automated Text Analysis

All texts were evaluated with Linguistic Inquiry and Word Count (LIWC), an automated text analysis tool that counts words as a percentage of the total word count per text (32). LIWC contains a validated internal dictionary of social (e.g., words related to family), psychological (e.g., words related to cognition, emotion), and part of speech dimensions (e.g., pronouns, articles, prepositions), and the tool measures the degree to which each text contains words from its respective dictionary categories. For example, the phrase “This science aims to improve society” contains 6 words and counts the following LIWC categories, including but not limited to: impersonal pronouns (this; 16.67% of the total word count) and positive tone words (improve; 16.67% of the total word count). All texts were run through LIWC-22 unless otherwise stated.

Measures

To evaluate how lay versus scientific summaries compared in terms of verbal simplicity, three measures were used from prior work to approximate simple language patterns (23): common words (e.g., the degree to which people use common and simple terms like job instead of uncommon and more complex terms like occupation), one’s analytic writing style (e.g., the degree to which people are formal and complex in their writing style compared to informal and their writing reflects a story), and readability (e.g., the number of words per sentence and big words in a person’s communication output).

Consistent with prior work (23, 33–35), common words were operationalized with the LIWC dictionary category. LIWC’s dictionary represents a collection everyday words in English (36, 37). Therefore, texts that use more words from this dictionary are simpler than texts that use fewer words from this dictionary. One’s analytic writing style was operationalized with the LIWC analytic thinking index, which is a composite variable of seven style word categories. Style words represent how one is communicating rather than what they are communicating about (24, 38). This index contains high rates of articles and prepositions, but low rates of conjunctions, adverbs, auxiliary verbs, negations, and pronouns (26, 39, 40). [1] Finally, readability was operationalized with the Flesch Reading Ease metric (29) and calculated using the quanteda.textstats package in R (41). High scores on the Flesch Reading Ease metric suggest more readable and simpler writing (e.g., texts with smaller words and shorter sentences) compared to low scores. These language dimensions were evaluated as an index by first standardizing (z-scoring) each variable and then applying the following formula: Common Words + Readability – Analytic Writing. High scores are linguistically simpler than low scores.

Analytic Plan

Since each article contained one lay summary and one scientific summary from the same article, independent samples t-tests were conducted for the simplicity index and each individual dimension of the index. All data across studies are located on the Open Science Framework: https://osf.io/64am3/?view_only=883926733e6e494fa2f2011334b24796.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] Analytic writing = [articles + prepositions - pronouns - auxiliary verbs - adverb - conjunctions - negations] from LIWC scores (40).

← Previous

Humans vs. AI: Who Writes Simpler Science—and Who Gets the Blame for Complexity?

Up Next →

Science Summaries Are Simpler, but Not by Much—Can AI Do Better?