Free database to 107 million research papers released online 

A database of billions of short phrases culled from 107 million publications might help researchers glean new knowledge.

More than 107 million science papers have just been cataloged for the public’s use thanks to a new project called The General Index. 

Typically, academic studies exist behind a paywall — locking up potentially important information not only from the public but, perhaps more importantly, from other scientists. 

The General Index wants to set that information free. The index acts almost like a Google search for scientific papers, but with a twist. Only snippets of the papers are provided, so it is up to users to mine the data and make sense out of it all. 

“Science is a language we must all speak if we are to better our world.”

Carl Malamud

The 38 terabyte database contains a collection of text snippets and samples, from one to five words long, extracted from 107 million journal articles. 

Why it matters: While the general public doesn’t often read academic papers, scientists do. They want to read the works of other scholars because building off previous scientific research can help them with their own work.

“There is no way for me – or anyone else – to experimentally analyze or measure the chemical fingerprint of each and every plant species on Earth,” Gitanjali Yadav, a University of Cambridge biologist, who did not work on The General Index, told Nature. “Much of the information we seek already exists, in published literature.”

Carl Malamud, an archivist and creator of The General Index, says that’s what he hopes this project will be used for. 

“This is a lookup tool, a dictionary of knowledge, a map to knowledge, a tool that we believe is a central facility to the practice of science in our modern age,” Malamud said in a video about the index. “We view this as a public utility. We assert no ownership over the general index. It is dedicated to the public domain. A series of unencumbered facts with which you can do what you will. There are no rights reserved.”

“We view this as a public utility. We assert no ownership over the general index. It is dedicated to the public domain. A series of unencumbered facts with which you can do what you will.”

Carl Malamud

What it is: We already have Google Scholar, a search engine that combs through scholarly literature to find the most relevant match to a search term. But The General index isn’t a search engine. Instead, it is a carefully cataloged and organized collection of scientific literature. 

In total, over 355 billion phrases and words, listed next to their corresponding articles, appear in the index, reports Nature. 

How it works: The purpose of the index is to help with text mining — discovering new information by scanning a ton of text, looking for patterns and trends. Humans have become pretty good at scanning headlines, tweets, and short bits of text. But we couldn’t possibly scan millions of articles, note critical bits of information, cross-reference, and connect the dots. 

The database can help us do the work. 

And, because the database only uses brief morsels extracted from each scientific paper and not the article itself, it is free to use and download without copyright restrictions.

 Extracting any useful information from tiny snippets, or n-grams (short sequences of words from each paper) is difficult for the average human. We can’t simply read the paper in its entirety, understanding each sentence within the context of the entire paper. So, to make sense of it all, scientists will need to use software and perhaps write their code to mine the data, recognize patterns, and use statistics or machine learning to glean any helpful information. 

The data may be downloaded straight from archive.org, which is a time-consuming method. But people on the /r/DataHoarder subreddit are uploading it to a remote server and distributing it throughout BitTorrent, reports Vice. 

“Science is a language we must all speak if we are to better our world,” Malamud said.

It is cool, but is it legal? The General Index is released on the heels of ongoing legal battles between publishers and Sci-Hub. The controversial portal is a pirate website that gives free access to millions of scientific research papers that are otherwise protected by copyrights. Several publishers filed a case against Sci-Hub, citing copyright infringement.

Without condoning Sci-Hub, many scientists say that publicly funded research should be freely available to the public. Doing so would help scientific knowledge continue to advance. Alexandra Elbakyan, the founder of Sci-Hub, was even crowned by Nature as one of the ten people in science who matter most

But Malamud, the creator of The General Index, told Nature that his database is different — he says it is 100% legal because he doesn’t release the complete paper, only snippets.  

“I am very confident that what I’m doing is legal. We are not doing this to provoke a lawsuit, we are doing it to advance science,” he said.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.

Up Next
right to repair
Subscribe to Freethink for more great stories