How Do We Address Information Overload in the Scholarly Literature?

29/08/2021

Representative image of a library. Photo: Renaud Camus/Flickr CC BY 2.0

Information overload is a common and old problem but the growth of preprints in the last five years presents us with a proximal example of the challenge.
The scholarly communication community is in a stage of “publish first, filter later”.
While preprints are lowering the cost of and delays to sharing information, filtering for relevant content is still at a relatively early stage.

Information overload is the difficulty in understanding an issue and effectively making decisions when one has too much information about that issue, and is generally associated with the excessive quantity of daily information
Wikipedia

Information overload is a common problem, and it is an old problem. It is not a problem of the Internet age, and it is not specific to scholarly literature, but the growth of preprints in the last five years presents us with a proximal example of the challenge.

We want to tackle this information overload problem and have some ideas on how to do this – presented at the end of this post. Are you willing to help? This post tells some of the back story of how preprints solve part of the problem – speedy access to academic information – yet add to the growing information that we need to filter to find results that we can build on. It is written to inspire the problem solvers in our community to step forward and help us to realise some practical solutions.

Using journals to find relevant information

In a classic presentation in 2008, the writer Clay Shirky argued that while information overload might be a problem as old as the 15th century when the printing press was invented by Gutenberg, the rise of the Internet for the first time had radically changed how we address this problem. Publishing used to be expensive, complicated and therefore risky, and this was addressed by only publishing content that was selected by the publisher to be “worth publishing”. Scientific publishing worked – and still works – in similar ways. One important change occurred with the dramatic growth of scientific publishing after World War II, when filtering by staff editors became unsustainable, and external peer review by academic experts slowly became the norm from the 1960s to the 1990s (for example, Nature in 1973 and The Lancet in 1976).

Clay Shirky coined the phrase “It’s Not Information Overload. It’s Filter Failure” in his 2008 presentation and made the point that publishing in the Internet age has become so cheap that publication no longer needs to be the critical filtering step, rather that filtering can happen after publication. We can see this pattern in many mainstream industries, from movies to online shopping, with organisations such as Netflix and Amazon investing heavily in recommender systems that substantially contribute to their revenues.

Cameron Neylon applied these considerations to scholarly communication and found the scholarly communication community at an early stage in the transition to “publish first, filter later”. Ten years later, his findings for the most part still hold true, as scholarly discovery services still for the most part focus on publications that have gone through a “filter” step by a scholarly publisher.

Preprints: an alternative to the ‘journal as a filter’

Preprints are the most visible implementation of the “publish first, filter later” approach. Preprints in some disciplines, including high-energy physics, astrophysics, mathematics, and computer science, increasingly became the norm in the last 25 years, and currently the majority of high-energy physics papers are first published as preprints on the arXiv. In the life sciences, the preprints server E-Biomed was proposed by NIH director Harold Varmus in 1999, but the project was killed after a few months, not least because of strong and vocal opposition by biomedical publishers and societies. Instead, PubMed Central launched in 2000 to host open access journal publications instead of preprints. After a delay of more than 15 years, preprints in the life sciences finally took off, and although they have grown considerably in number in the last five years, preprints still only represent a small fraction (6.4% in Figure 1) of all publications in biology:

Figure 1. Yearly preprints/all-papers in Microsoft Academic Graph, trend by domain, reproduced from Xie B, Shen Z, and Wang K 2021.

The notion of “publish first, filter later” is now being promoted by a range of publishers who no longer penalise authors for publicising their submissions as preprints, but rather encourage submitting authors to post their manuscripts as preprints whilst these are being put through peer review by the journal. Some publishers are even more wedded to preprints as the publication of the future.

Coming back to the original problem, preprints now add to the journal articles that researchers are tasked with filtering. That information overload poses a problem was recognised in a survey of stakeholders (such as librarians, journalists, publishers, funders, research administrators, students, clinicians, and more) conducted last year by ASAPbio. The problem is exacerbated given the number of servers hosting relevant preprints – ASAPbio’s preprint platform directory lists 56 preprint servers that host potentially relevant material.

Filtering for relevant content

While preprints in general, and specifically in the life sciences, are lowering the cost of and delays to sharing information, filtering for relevant content is still at a relatively early stage. To go deeper into the details of how relevant preprints can be discovered, it is important to make the important distinction between

Discovering relevant preprints at any point in time independent of peer review status
Discovering relevant preprints that have undergone peer review
Discovering relevant preprints immediately (days) after posting

The first category includes discovery services that also include preprints as part of their content, including for example Europe PMC and Meta. Discovery strategies relevant for journal content can also be applied to preprints, e.g. search by keyword and/or author.

The second category focuses on peer-reviewed preprints, and is covered extensively elsewhere.

The third category is the focus of this post – discovery of relevant preprints of interest to a researcher right after their posting, which rules out traditional peer review. The following filter strategies are possible:

Filter by subject area, keyword or author name
Filter by personal publication history
Filter by attention immediately after publication: social media (Twitter, Mendeley, etc.) and usage stats
Filter by recommendations, e.g. from subject matter experts

These filters can of course also be combined. The particular challenge is that they must work almost immediately (within days) after the preprint has been posted. This assumes a high level of automation, and a focus on immediacy. A combination of filters 1 and 3 works well with this approach: the information required for filter 1 is available in the metadata (e.g. via Crossref) when the content is posted, and attention (filter 3) can be determined immediately after the preprint is posted – Twitter is widely used for sharing links to bioRxiv/medRxiv preprints, see examples in Figure 2. The Crossref Event Data service found 15,598 tweets for bioRxiv/medRxiv preprints the week starting June 7, 2021.

We don’t talk enough about the fact that scientists tweeting their preprints has been one of the most significant applications of the @DORAssessment principle to recenter attention on a paper and its merits as opposed to the brand and impact of the outlet where it’s published. pic.twitter.com/FhCJ2zdpQ2

— Dario Taraborelli (@ReaderMeter) June 3, 2021

For filter 3, we’ve considered ‘bookmarking preprints in Mendeley’ but these cannot currently be tracked in open APIs such as the Crossref Event Data service. Usage stats are another alternative, but are currently not available via API in the early days after publication.

Another consideration is how to best inform researchers of these potentially relevant preprints. Given that cost and speed are the primary concerns, we consider the most appropriate approach to be dissemination of these filtering results via a regular (daily or weekly) RSS feed or newsletter.

In summary, realising a list of biomedical preprints that have been filtered by a minimal number of tweets in the days after posting, and broken down by subject area, is a good initial filtering strategy to identify relevant preprints immediately after they have been posted. Interested researchers can access a filtered corpus via newsletter.

Existing efforts that track discovery of relevant preprints right after their posting

A few examples

PreprintBot – new this year, “a bot that tweets preprints and comments from BioRxiv and MedRxiv”
PromPreprint – this has been running for a while; “A bot tweeting @biorxivpreprint publications reaching the top 10% Altmetric score within their first month after publication”
http://arxiv-sanity.com/toptwtr – this started as a new way to list all arXiv preprints, but they added social media data at some point
https://scirate.com – a free and open access scientific collaboration network that allows users to follow arXiv.org categories and see the highest ranked new papers
https://rxivist.org – a free and open website that enables users to identify preprints from bioRxiv and medRxiv based on download count or mentions on Twitter. One can, for example, pick the most tweeted preprints in the last 7 days – and this presents a list of preprints that may have been posted at any point since the servers began.

Our strategy for filtering life science preprints builds on these existing efforts but picks up only those preprints posted in the past week that have received tweets and proposes to use a newsletter as the primary communication channel. We propose to run this newsletter as a community experiment, where we iterate over the implementation based on researcher feedback on how helpful the newsletter is in addressing information overload. Other considerations: Can we focus more on who is tweeting rather than the number of tweets, or should we add an element of human curation? Can we filter life science preprints from additional servers?

Call to action

If you want to help to tackle the information overload problem in the life sciences then leave a comment below or DM us. If enough folk are interested in working with us, we could generate a community group under the auspices of ASAPbio to work on information overload.

This article was first published on ASAPbio.