How to organize a PhD when buried under a mountain of data

I will preface this by saying I am not an organized person, if you need proof just look below at the picture of my desk.

Research projects are inevitable in life: their topics range from planning a trip or event to writing a PhD. At least for me, one of the hardest things about researching things and doing research projects is staying organized. But more on that later.

My desk

What is data and how can it become a mountain? Data is defined by the Oxford English Dictionary as “Facts and statistics collected together for reference or analysis.” Nowadays data infiltrated every aspect of our lives. One of the primary tasks during my PhD has been to identify how microorganisms use basaltic rock as a substrate. To do this I have collected tomography data at a variety of scales (producing data sets which can resolve features that are tens of micrometers to other data sets that can be used to observe features which are larger than 500 nanometers). Now that it is collected I have to analyse it all. As Pavel has mentioned in earlier posts tomography datasets are thousands of individual files that together can be used to create a 3D rendering of the object that was scanned.

It is because of this that I have ended up with a mountain of data to climb. The computer on my desk in the image above has 8 TB of storage. Next to my desk is a server which has a capacity of ~65 TB and scattered around my office and apartment are more than 15 portable hard drives, each with a capacity of at least 3 TB. At last look, I have over 40 TB of primary data, all of which must be stored in duplicate, most of these data will balloon to 3 times their original files sizes during the analysis process.

Datasets of this size are nothing new, and an entire field, Big Data, is dedicated to figuring out how to analyse, store, and manage such data sets. Organizing and managing these kinds of data is not very different than organizing any data or primary research you might conduct during a PhD project, MSc project or everyday life. The only difference here is magnitude.

I started my PhD over 2.5 years ago, and I went in naively thinking that setting up some folders to save things in an organized fashion would be enough. Little did I know that I would end up with so much data and ultimately, I have had to devise a system of managing it all on the fly. I would not recommend that. It makes things very confusing and rather unhelpful.

When managing personal datasets and personal research there is no best method so to speak. The best organization system is one that gets used and one that works for an individual. Note: this is not true for widely used datasets where versioning, a robust naming method, and consistent organization is key. That said, there are a few things that I have found make life much easier. Choose a method and stick with it. For example, if you start with putting the date in every file name so you know when the file was originally created then you should continue with that.

Personally, for everyday work and everyday analyses I have a panoply of folders that are split up into categories as you can see in the image below. I also store everything in a paid dropbox account (not an advertisement, I just love the service) so that all the files are automatically stored in the cloud as well, and very basic versioning is performed. This works passably well for me, but may not work for everyone.

File organization tree

So why does this matter for anyone who is not doing a big academic research project? Everyone has research projects, even if they do not necessarily think of them in that way. Where do I want to go on vacation? Where do I want to host a party? What is the best restaurant in my price range in my city? These are all questions which can be researched in everyday life. There are many ways to do so, a fair number of people like take the approach of flying by the seat of their pants, others will create detailed dossiers of their options. Those who take a lackadaisical approach may have once found the perfect restaurant, but cannot remember where it was or how they found it. They then end up not being able to return (I do this all the time). Alternatively, some may compile documents with tens of vacation options only to decide that they are not going this year. Finding a method of organizing files, data, etc that works for you can streamline your entire research process. I know it certainly worked that way for me.