Drowning in data

Within 18 months of starting my last postdoc position (based at Melbourne University), I had generated well over 50 terrabytes (Tb) of data.

Files were jammed into folders wherever I could squeeze them. Storage space was so tight, my data were essentially shoved under the beds of other researchers and in their wardrobes.  I imagined it was inevitable that one day, one of my colleagues would unknowingly go searching for something, open a virtual door and all my data would spill out onto the floor. I would be forced to confess that I’d been hogging their storage space for sometime.

An analogue imagining of my digital data difficulties (Anselm Kiefer’s stack of lead books, Museum of Old and New Art, Hobart)

An analogue imagining of my digital data difficulties (Anselm Kiefer’s stack of lead books, Museum of Old and New Art, Hobart)

My data storage issue haven’t eased. I’ve moved to another university but that just means another set of wardrobes to fill up. I’m not the only one being overwhelmed by an avalanche of research data. Recent technological change is enabling the generation of increasing amounts of research data, which presents huge problems for data management and preservation.

Until recently, I was blissfully ignorant of any necessity to manage and preserve data. When I finished my PhD and moved on to Melbourne University, I neatly packed up my physical samples (bits of Indonesian stalagmites), labelled them and patted myself on the back. So organised, well done Lewis!

Meanwhile, I had one last research paper to write. My computer exploded (literally) just after I finished my draft and emailed it to my co-authors. In my eyes, it was perfect timing.  I could put my Phd behind me and start afresh.

In the first year of my postdoc, I was “volunteered” to sit on data preservation and archiving advisory group at my university. I have since discovered that data generated as part of funded research project that leads to a publication must be preserved for a minimum of five years, if not longer.  Luckily I have reasonably comprehensive back ups of my old work, but it turns out that my computer’s explosion was not a fortuitous occurrence after all.

And what if every researcher at my university produces 50 Tb of data a year, successfully publishes research papers and needs their data archived for at least 5 years? Grappling with the explosion of research data is a huge challenge for universities, particularly when we consider the added complications of new publication models, such as open access publication.

Not only do we need to deal with mountains of data, we also have to consider various policy and ethical constraints. For example, how do we deal with sensitive data generated in medical research? Or what about data that are considered invaluable and must be keep for perpetuity?

A recent Nature  article gives an example of data management which neatly highlights that our data mountain presents not just challenge, but also opportunities for the savvy.

University libraries are reinventing themselves and becoming more active partners in research, rather than simply repositories of information.  At Johns Hopkins University, the director of digital research and curation installed a huge data visualisation wall in the library.  Students and researchers can explore some of the university’s research data using the televisions screens. Research products range from illustrated medieval manuscripts to images from the Hubble Space Telescope.

Many researchers are too busy to feel as though they can invest in data management. Or just like me rejoicing my computer’s explosion, lack the knowledge to adequately manage their data. Meanwhile, libraries are increasingly filling this gap in data management expertise.

The Nature article notes that the recent shift in focus to digital data management isn’t necessarily a huge leap for libraries. They have always that the capacity to organise information, preserve it and make it available to researchers.

I hope researchers can also embrace the opportunities that come with our vast data management challenge. Generating new data can be expensive, both in terms of time and money and there is the potential to get more from our data by mining what already exists.  But for us to use our data more effectively, proper management and preservation is a necessity.

Whilst my old university works on a data management strategy for the coming years, I am crossing my fingers and hoping that my now 100s of terrabytes of  files don’t explode out of a supercomputer anytime soon.

I’m also hoping that I can learn a little from my PhD and postdoc data mistakes and being my new postdoc at ANU with a little more awareness of good data management practices.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s