Making Data Pipelines Easier for Researchers

With higher fidelity sensors, research is generating more data than ever. Scientists need to utilise the next generation of cloud storage and compute technologies to keep up.

Research is data

Quantitative research cannot get around dealing with data and in many ways “data scientist” is a redundant phrase. Whatever the hypothesis being tested, it is typically being done through accumulation of empirical and experimental data. Data gathering and generation, as any researcher knows, is unhelpful unless it is processed efficiently and made to adhere to the parameters and reference points of the experiment. This whole process requires, amongst other things, storage capacity and compute resources which are traditionally built and maintained as expensive on-premises data infrastructure, and which usually require to be shared by several researchers and teams.

The Evolving Situation for Researchers

Across different research fields and different scientific specialties, researchers are encountering similar issues when it comes to data. Technological advancements in scientific tools and lab equipment is increasingly leading to larger volumes of data. This means a deluge of raw information that is cheaper to generate but is harder to know what to do with.

Genomics Data

The Human Genome Project, which started in 1990, was the first attempt to fully sequence all 3 billion base pairs of the human genome. It took 13 years at a cost of $5 billion AUD (or $2.7 billion USD). Now, a sequencing a whole human genome takes a matter of hours and only a few hundred dollars. In addition to that, tasks which were previously highly manual have become automated, leading to a significant increase in sequencing throughput.

Microscopy/Medical Imagery Data

Advances in imaging technology with light, electron and X-Ray microscopes have led to higher resolution scans and more data points being captured across various spectrums. Each scan can contain multicolour channels as well as image z-stacks. Each of these data-rich, multi-dimensional visuals can vary in size from a few gigabytes to hundreds of gigabytes, multiplied by however many snapshots are taken across time. This leads to terabytes of data being generated for a single time-lapse. This increased volume of data can add complex data wrangling requirements to the time-consuming and meticulous work already performed in biology labs.

Earth Sciences/Aerial Imagery Data

Higher resolution lenses, drone footage as well as improvements in IoT sensors mean more data than ever for Earth Science researchers to deal with. The NASA Earth Observing System Data and Information system which brings together earth data from numerous aerial sensors, adds 6.4 terabytes to its database every single day.

Data Infrastructure for Researchers

The good news, however, is that storage and compute resources are becoming cheaper as well. Additionally, it is no longer strictly necessary to have dedicated hardware infrastructure and teams on premises to deal with storage and computation. Cloud-based services mean that researchers can increasingly be free from local hardware limitations and having to queue for access to resources. Given how much the cost of cloud data storage has decreased, the problem of where to put all this data in a cost-effective manner is solved. However, it still doesn’t answer how to process it effectively and extract the insights that are required by any particular experiment.

ELT for Research Pipelines

In a way, scientists were dealing with Data Swamps before the rest of the world. Since they’ve always generated large volumes of data in a variety of structured and unstructured formats, a common way of managing it would be to put all into a home folder and determine how to deal with it later. Data governance and discoverability has always been a challenge and has only become more difficult with the increases in data volume.

More good news, however, is that trends in data management in other industries have developed to match these problems long faced by scientists. Dealing with large volumes of unstructured data, rather than conventional structured business databases, Data Lakes have become a common solution to these problems. A Data Lake will typically follow the Extract, Load, Transform (ELT) model of data ingestion, meaning that the data is first saved into storage, with structure, schemas and discoverability being applied after the fact. This would be an approach familiar to many researchers.

Data Governance for Researchers

Cloud-based infrastructure, reduced storage costs and more versatile and mature business data management tools are all powerful boons in tackling the problems posed by voluminous empirical and experimental research data. However, an important new factor emerges in this evolving data landscape: the need for good research Data Governance. What this means is an increased emphasis on metadata management, data glossaries, data lineage tools and others to provide a clearer end-to-end understanding of the data pipeline as well as defined individual roles in data management and an overall philosophy of treating data as an increasingly valuable asset.

How Loome Can Help

Multi-purpose data ingestion tools such as Loome Integrate provide control and transparency over high-bandwidth research data pipelines. Additionally, tools like Loome Publish enable advanced exploration and visualisation portals to be created, underpinned by tools that enable easy implementation of data best-practice.