Saturday, September 8, 2012
There have been a number of recent posts in the blogsphere about what is perceived to be the rather poor quality of many computer programs in bioinformatics. Basically, many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.
I thought that I might draw your attention to a few of the posts here, for those of you who write code. Most of the posts have a long series of comments, which are themselves worth reading, along with the original post.
At the Byte Size Biology blog, Iddo Friedberg discusses the nature of disposable programs in research, which are written for one specific purpose and then effectively thrown away:
Can we make accountable research software?
Such programs are "not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories." I have much sympathy for this point of view, since all of my own programs are of this throw-away sort.
However, Deepak Singh, at the Business|Bytes|Genes|Molecules blog, fails to see much point to this sort of programming:
He argues that disposable code creates a "technical debt" from which the programmer will not recover.
Titus Brown, at the Living in an Ivory Basement blog, extends the discussion by considering the consequences of publishing (or not publishing) this sort of code:
He considers that failure to properly document and release computer code makes the work anecdotal bioinformatics rather than computational science. He laments the pressure to publish code that is not yet ready for prime-time, and the fact that computational work is treated as secondary to the experimental work. Having myself encountered this latter attitude from experimental biologists (the experiment gets two pages of description and the data analysis gets two lines), I entirely agree. Titus concludes with this telling comment: "I would never recommend a bioinformatics analysis position to anyone — it leads to computational science driven by biologists, which is often something we call 'bad science'." Indeed, indeed.
Back at the Business|Bytes|Genes|Molecules blog, Deepak Singh also agrees that "a lot of computational science, at least in the life sciences, is very anecdoctal and suffers from a lack of computational rigor, and there is an opaqueness that makes science difficult to reproduce":
Titus has a point
This leads to Iddo Friedberg's post (the Byte Size Biology blog) that mentions this concept:
The Bioinformatics Testing Consortium
This group intends to act as testers for bioinformatics software, providing a means to validate the quality of the code. This is a good, if somewhat ambitious, idea.
Finally, Dave Lunt, at the EvoPhylo blog, takes this to the next step, by considering the direct effect on the reproducibility of scientific research:
Reproducible Research in Phylogenetics
He notes that bioinformatics workflows are often complex, using pipelines to tie together a wide range of programs. This makes the data analysis difficult to reproduce if it needs to be done manually. Hence, he champions "pipelines, workflows and/or script-based automation", with the code made available as part of the Methods section of publications.