How to organize a PhD when buried under a mountain of data

I will preface this by saying I am not an organized person, if you need proof just look below at the picture of my desk.

Research projects are inevitable in life: their topics range from planning a trip or event to writing a PhD. At least for me, one of the hardest things about researching things and doing research projects is staying organized. But more on that later.

My desk

What is data and how can it become a mountain? Data is defined by the Oxford English Dictionary as “Facts and statistics collected together for reference or analysis.” Nowadays data infiltrated every aspect of our lives. One of the primary tasks during my PhD has been to identify how microorganisms use basaltic rock as a substrate. To do this I have collected tomography data at a variety of scales (producing data sets which can resolve features that are tens of micrometers to other data sets that can be used to observe features which are larger than 500 nanometers). Now that it is collected I have to analyse it all. As Pavel has mentioned in earlier posts tomography datasets are thousands of individual files that together can be used to create a 3D rendering of the object that was scanned.

It is because of this that I have ended up with a mountain of data to climb. The computer on my desk in the image above has 8 TB of storage. Next to my desk is a server which has a capacity of ~65 TB and scattered around my office and apartment are more than 15 portable hard drives, each with a capacity of at least 3 TB. At last look, I have over 40 TB of primary data, all of which must be stored in duplicate, most of these data will balloon to 3 times their original files sizes during the analysis process.

Datasets of this size are nothing new, and an entire field, Big Data, is dedicated to figuring out how to analyse, store, and manage such data sets. Organizing and managing these kinds of data is not very different than organizing any data or primary research you might conduct during a PhD project, MSc project or everyday life. The only difference here is magnitude.

I started my PhD over 2.5 years ago, and I went in naively thinking that setting up some folders to save things in an organized fashion would be enough. Little did I know that I would end up with so much data and ultimately, I have had to devise a system of managing it all on the fly. I would not recommend that. It makes things very confusing and rather unhelpful.

When managing personal datasets and personal research there is no best method so to speak. The best organization system is one that gets used and one that works for an individual. Note: this is not true for widely used datasets where versioning, a robust naming method, and consistent organization is key. That said, there are a few things that I have found make life much easier. Choose a method and stick with it. For example, if you start with putting the date in every file name so you know when the file was originally created then you should continue with that.

Personally, for everyday work and everyday analyses I have a panoply of folders that are split up into categories as you can see in the image below. I also store everything in a paid dropbox account (not an advertisement, I just love the service) so that all the files are automatically stored in the cloud as well, and very basic versioning is performed. This works passably well for me, but may not work for everyone.

File organization tree

So why does this matter for anyone who is not doing a big academic research project? Everyone has research projects, even if they do not necessarily think of them in that way. Where do I want to go on vacation? Where do I want to host a party? What is the best restaurant in my price range in my city? These are all questions which can be researched in everyday life. There are many ways to do so, a fair number of people like take the approach of flying by the seat of their pants, others will create detailed dossiers of their options. Those who take a lackadaisical approach may have once found the perfect restaurant, but cannot remember where it was or how they found it. They then end up not being able to return (I do this all the time). Alternatively, some may compile documents with tens of vacation options only to decide that they are not going this year. Finding a method of organizing files, data, etc that works for you can streamline your entire research process. I know it certainly worked that way for me.

Life as a (semi-) nomadic early career scientist

One of the great things about being a geoscientist is that travel is often an integral part of your research and work. Geoscientists work in the field, we go to conferences and short courses all over the world, and some of us even move countries for our jobs. This often means being thrown head first into a new country and culture. An early career scientist (ECS) is someone who is very early into their scientific career, for example all of the regular authors at SeaRocks blog. While the exact definition of who qualifies as an ECS varies there is nearly always one consistency: an ESC’s life, such as mine, is often filled with uncertainty of what, and where, is next. A PhD is a fixed term contract. There are no guarantees that your next position, be it a post-doc, or job in industry, will be in your current city, or even on the same continent. Continue reading

What are microbes, and wait, they are found in rocks?

A very (in)famous mathematician, Dr. Ian Malcom, Jurassic Park, once said five very insightful and philosophical words. “Life, uh…finds a way

Dr Ian Malcom in Jurassic Park

While he was referring to breeding dinosaurs, which films have taught us is not a good idea, he was nevertheless correct in a different context. Life very often finds a way to exist in unexpected places and ways, and often that life is microorganisms. Continue reading