We Need Open Data, Not Open Access

Posted on May 9, 2012 by mikethemadbiologist

The Dreaded Nat makes two good points about open access publishing. First (boldface mine):

So what does it all mean for the average layperson? Precious little, I’m afraid. The Wellcome Trust’s move towards open access is highly unlikely to lead to a significant increase in readership for the research they fund. After all, as someone pointed out in the comments section of one of the aforementioned Guardian articles: Cancer Research UK has had a similar policy for quite some time. Has this led to an increase in public perception of the fact that much of what they fund is bland repetition or mere vivisection for vivisection’s sake, in a world where such ‘research’ has been rendered all but obsolete by numerical modelling? Is the average person better able to make a properly informed decision when they have a collecting tin thrust into their face outside a supermarket? Of course not. Most people are entirely ignorant of whether or not research articles are freely available to them because they have absolutely no interest in reading them. If there was any real demand for this kind of material, it would be readily available via bittorrent. As Brant Moscovitch points out, the key problem is not one of public access to research findings in their current form, but one of making research findings accessible to the non-specialist. Given the appalling standard of British journalism, these days, it may be high time academics began to overcome their traditional reticence on this matter, if they’re at all interested in reaching a wider audience. Unfortunately, the Fred Pearces of the world are few and far between. However, relatively few researchers will be sufficiently motivated to meet this challenge; even fewer will have the necessary aptitude to succeed. Anyone who has been involved in large interdisciplinary research projects will have witnessed the total inability or sheer lack of will many scientists manifest, when it comes to making their research more readily understood by those from other fields, even when the potential benefits (i.e. increased impact in real terms) are clear.

Besides a general lack of interest, there’s also an issue of verification that open access doesn’t really solve (boldface mine):

For this writer, the price of journal subscriptions is a peripheral issue. However it is resolved, green or gold, will merely serve to paper over the cracks in the current system for a little while longer. The fundamental shortcomings of the traditional peer-review process, in the modern context of sheer volume of papers produced and proliferation of journal titles, is an issue that will continue to rear its head. Recent articles by Carl Zimmer in the New York Times and Marcus Oransky in the Boston Globe highlight the shocking increase in retractions in recent years, as a symptom indicative of a moribund system. The root of the problem goes deeper than the ‘publish or perish’ culture, which has given rise to the system’s failure to sort wheat from chaff adequately. Bill Mitchell‘s illustration of the way in which the dominance of the (degenerative) orthodox paradigm in economics is reinforced has wider relevance. Many who have had well-produced findings with genuine implications rejected outright by Nature, only to marvel at some of the (increasingly likely later to be retracted) chaff that did manage to be included, must suspect that there are dark forces at work. The fact that the ‘Academic Spring’ coverage in The Guardian overshadowed Brian Deer‘s piece, which highlighted a more crucial issue, was a genuine shame. The financial interests of those involved in all stages of the peer-review process may be critical to understanding its current lack of rigour in many scientific fields. The anonymity of reviewers, together with the culture that perpetuates it, represents a serious problem for any potential resolution. Given that journals and peer-review are international in nature, attempts at regulation on a national level would appear doomed to have little (if any) effect.

While it wouldn’t solve all problems completely (what would?), open data would go a long way towards doing so.

First, it would help immensely with the verification issue. If the underlying data were released with publication, people could still commit fraud but it would be a lot harder: if nothing else, others cold analyze the data for inconsistencies. This isn’t a bizarre idea–in computer science, code is usually scrutinized as part of review.

Second, it would get more people outside of the field interested in the work. Admittedly, most people still wouldn’t look at it (sorry to burst your bubble), but more people would, especially those with analytical and statistical skills outside of your discipline. To give an example of this, regular readers will know this blog often discusses politics (as the kids used to say, duh). One of my perpetual frustrations with articles that discuss polling results is that there’s no access to the raw data. If you’re willing to subscribe to certain polling services, you can gain access to detailed reports, but, even then, you don’t get the raw data. Personally, I’m even less interested in a detailed report–I want the raw data so I can use my superpower of multivariate statistics for good or evil (depending on my mood). I become much more interested in the polling reports when I get to play with the data. And who knows? Maybe I’ll contribute something useful to boot.

For those who think data release is unrealistic, it already happens as a matter of course in genomics. Not only are genomic assemblies (the DNA sequence) available at GenBank, the annotated genomes (genomes with identified genetic features) are also publicly available. Moreover, the raw data are available in the Short Read Archive, so anyone could rebuild the genome from scratch if desired. Data upload and download isn’t really an issue–these are large files and can still be accessed quickly. And let’s be honest: a lot of research in biology involves much smaller datasets (not that there’s anything wrong with that–many types of data are expensive and time-consuming to collect). There’s no reason why a simple spreadsheet couldn’t be released along with publication.

This won’t work for every field, and there will be some complications (e.g., de-identifying clinical metadata–we don’t want to reveal a subject’s HIV status). While not perfect, we should also be fighting for Open Data.

This entry was posted in Genomics, Publishing. Bookmark the permalink.

4 Responses to We Need Open Data, Not Open Access

BigRed says:

May 9, 2012 at 12:37 pm

Sorry to burst *your* bubble but

This isn’t a bizarre idea–in computer science, code is usually scrutinized as part of review.

is not necessarily true. I do research in computer science, machine learning and data mining to be more specific, and at least in our subfield, code is rarely made public without request, even more rarely scrutinized, and empirical results are in many cases not ever verified by any outsider. I’ve been more than once in the situation that author-provided code lead to different, usually worse, results, not to mention the results re-implementations will produce, but it is effectively impossible to get work published that points this out.
As a result of this retractions are unheard of, and knowledge of a method’s shortcomings usually not wide-spread. And of course we have the same problem that plagues peer reviewing in other fields.
In a funny twist, I’ve heard more than once heard the recommendation to make publication of code, data sets, and raw experimental results a precondition for paper acceptance, usually accompanied by the claim that Nature requires precisely this.

Loading...
- mikethemadbiologist says:
  
  May 9, 2012 at 4:39 pm
  
  Point taken, but it still seems to work in genomics (open data).
  
  Loading...
  - BigRed says:
    
    May 10, 2012 at 2:09 am
    
    And don’t get me wrong, I’m all for it – if all it takes is looking up a URL in the paper, and download data (and implementation) to re-run experiments, I’m (and others are) much more likely to double-check this.
    
    Loading...
Pingback: Links 5/15/12 | Mike the Mad Biologist

Comments are closed.