5 Sept 2010

What is the scientific paper? 4: Access

This is a guest post by Joe Dunckley
Completing the series exploring the question "what is the scientific paper?", reposted from my old blog, and originally written following Science Online 2009. As I reminded people at the time, these were just my own half-thought through ideas, not the policy or manifesto of anyone or anything I'm affiliated with.
A friend of mine once told me how much she hated "the proliferation of these bioinformatics papers." All these simulations and models of what happens in real life. All of it utterly useless -- since when was the stuff that comes out of a computer worth anything? None of it even remotely reflects anything that happens in real life. And the methodology papers -- the endless methodology papers. They're making yet another neural network and modifying a bayesian something-or-other, when they haven't even found where they left the markov models yet! How can you have so many of these methodology papers? Clearly they can be no more than incremental advances. (Of course, BLAST is an exception -- it's old enough to have been around and heard of when we were undergrads, and is therefore a perfectly legitimate and mainstream molecular biology tool.)
Similarly, some people still voice their skepticism about the need for open access. Access isn't really a problem, is it? These open access advocates are just making facile arguments about the how the people who pay for scientific research should have some kind of say regarding its dissemination.[1] Come on, really, show me, who is in want of access? Everyone (everyone who matters) already has subscriptions, right? Access isn't a problem. And the open access "movement" isn't an ideology. It's just another business model.
And then, yesterday afternoon m'colleague shouted for advice handling an author of a scientific manuscript who was questioning the need to deposit her not inextensive collection of genomes in a database. I don't blame the author for wanting to get out of the chore—she had a lot of data, and depositing it will be a dull repetitive task. M'colleage was trying to write a letter and struggling to put into words the reason why we mandate deposition of sequence data, and why merely including them as supplementary MS Word files isn't good enough.
These attitudes, you will have noticed, have one particular thing in common: they all completely miss the fact that the biomedical sciences have moved on in the past quarter century. In almost every field (lets not wake the poor taxonomists) the science being done and the science being published today are not quite like that of 25 years ago. Even if the science of today were like that of 25 years ago the case for open data sharing would be strong enough; as it is, it's simply absurd to think that open sharing of data isn't worth doing.
--
Individual scientific papers -- the basic units of scientific research -- are rarely exciting; rarely even interesting. Where nerds get excited about science, it's where science offers a beautiful explanation for how the world works. And scientific papers don't do that. They offer some speculative interpretations of data on obscure problems in obscure systems. It is the literature as a whole -- hundreds of dull papers put together -- which tells a complete and exciting story. The sum is more than the parts -- the theory is more than the data.
In the field I know best -- cancer cell biology -- 99 in 100 papers published are tedious details, discovered with a science-by-numbers formula. The (anti-)proliferative effect of one abbreviation interacting with another abbreviation in three-letter-acronym-and-a-number cells, concluding with a suggestion that the authors' work might have implications for cancer treatment and a note that further work is necessary. Or even better, the complete lack of anything interesting at all happening when the first abbreviation interacts with the second. The abbreviations and their effects have been studied, in combination with others, in all of the most widely used three-letter-acronym-and-a-number cell-types, and somebody is scraping the barrel.
But the tedious details put together add up to an understanding of how the cell works and how it goes wrong. The details could be put together by a human, going through the thousands of papers on the topic, assembling the facts and finding the trends. Or, more plausibly, given the amount of tedious details out there, they could be assembled by a computer, with a database and a clever algorithm. Except that four in every five of those tedious details, discovered at great expense to taxpayers, will be inaccessible to that clever algorithm. They will be locked away in the basements of university libraries, hidden in human-readable prose that humans will never read. The results of billions of pounds of work searching for an understanding of cancer and a better chance at defeating it will be worthless, because they will never be amongst the parts that add up to the greater whole.
So I told m'colleague to explain to her author that unless she deposits her genome sequences, the last three years of her professional life will ultimately have been wasted. An average paper in a high-volume mid-tier journal that will be glanced at by a few colleagues when published. Another bullet point on a CV. They will never further science beyond that. They won't contribute any important discovery or real advance to the field. They will be forgotten. Nobody will seek them out when the time comes to make the leap forward.
That's just where biology is at these days: lots of tiny fragments of data, spread thin through the literature. The most interesting and important unanswered questions will require the synthesis of that work. The most interesting and important questions can't be answered without the heap of data that has already been produced, but which is locked away.
On machine readable data, Mike Ellis says, "at some point in the future, you'll want to do "something else" with your content. Right now you have no idea whatsoever what that something else might be." This is especially true in science: at some point in the future, tedious data obtained at great expensive, as part of the bigger picture, will finally be important and valuable. Right now, you can have no idea how important.
Publishers are allowed to get away with keeping science closed, holding it back, and wasting public money because there are still sufficient numbers of scientists who let them -- who have themselves failed to grasp that the world and science have changed.

No comments: