Before joining JENGA School, I had never heard of the concept of data pipelines. The concept was first introduced to me in the recently concluded second Module of the Data Science Core Program. I have to admit that this was one of the many times that I learned new things during the semester. Here is what I learned about data pipelines.

Simply put, a data pipeline is a system of automatically moving raw data from one or more sources in a series of steps to a target destination (such as data warehouses, data lakes e.t.c), for storage, analysis, and insights mining. Various ETL techniques may be applied to the data along the pipeline such that in the end it is processed and ready for analysis.

An illustration of a data pipeline

Why Data Pipelines?

Data Pipelines are essential in data analytics for several reasons.

First, data analytics is a computationally demanding task which if performed on the production environment (where the data is created), can impair the performance of the system as well as slow down the analysis.

Secondly, data needs to be aggregated in ways that make sense. An example is where you have one system that stores user files and another that captures events. Having a separate system for analytics ensures that you can combine the various data types without degrading your production system. Moreover, it is a lot less risky to make changes to data in a separate system.

Other justifications for data pipelines may include issues to do with data privacy where for example you may not want data analysts to have access to all of your organization’s data.

What Does a Data Pipeline Look Like?

Moving data from one system to another requires many steps, each usually requiring separate software. The General architecture of a data pipeline includes the following parts and processes:

  • Sources

This is where the information comes from. There are various sources of data such as relational databases, Apache servers, APIs, NoSQL e.t.c

  • Joins

This is the stage where the criteria and logic for how the data is combined is defined

  • Extraction

Here, the required values of the data are extracted. Sometimes certain specific data that is found in larger fields is required. A good example is extracting area codes from the telephone number contact field.

  • Standardization

Standardization is where you ensure all data follows the same measurement units and is presented in an acceptable size, font, and color.

  • Correction

In this stage, errors, as well as corrupt records, are checked and removed.

  • Loads

Once the data is cleaned up, it is loaded into the analysis system such as a data warehouse, a Hadoop framework e.t.c

  • Automation

Along the different stages, a data pipeline employs automation of processes. It could process such as error detection and monitoring.

As the world becomes more and more data-driven, many companies will continue to find the need for data pipelines. Therefore, building a data pipeline is a skill that many data scientists and engineers need to be well versed with.

Start chat

JENGA School

FEIN 86-3313060

JENGA School
Nondiscriminatory Policy

JENGA School admits students of any race, color, national and ethnic origin to all the rights,
privileges, programs, and activities generally accorded or made available to students at the
school. It does not discriminate on the basis of race, color, national and ethnic origin in
administration of its educational policies, admissions policies, scholarship and loan programs,
and athletic and other school-administered programs.