ACA Blog

ACA Blog


June 2021
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
282930  

Categories


How to set up simple and flexible ETL based anonymization – pt 3

Jan EerdekensJan Eerdekens

In previous two instalments of this series, we covered the research part and the creation of an ETL pipeline using Singer. In this third and last part, we focus on how we can change the pipeline to something more cloud native.

Pipelinewise

We already encountered the name Pipelinewise during our research when looking for Singer taps and targets. During that search, we found the Pipelinewise Transform Field step and the Pipelinewise Target S3 CSV target, used in our Singer based pipeline solution.

The Pipelinewise website reads the following:

PipelineWise is a Data Pipeline Framework using the Singer.io specification to ingest and replicate data from various sources to various destinations.

Built with ELT in mind: PipelineWise fits into the ELT landscape and is not a traditional ETL tool. PipelineWise aims to reproduce the data from the source to an Analytics-Data-Store in as close to the original format as possible. Some minor load time transformations are supported but complex mapping and joins have to be done in the Analytics-Data-Store to extract meaning.

Pipelinewise basically extends Singer from an EL product into an ELT product. The transformation options are limited, but more than enough to fit our use case. What also immediately catches the eye is the Running in Docker section in their documentation.

Pipelinewise basically extends Singer from an EL product into an ELT product. The transformation options are limited, but more than enough to fit our use case. What also immediately catches the eye is the Running in Docker section in their documentation.

If we can replicate our Singer based pipeline in Pipelinewise, this additional Docker functionality allows us to easily run our complete solution in a cloud native way. That means we won’t need to bother with building one or more custom Docker containers around our existing tap/transform/target solution.

Replicating our Singer pipeline in Pipelinewise

As is probably the case for a lot of developers these days, Docker was already installed on my system. In case it isn’t for you, just follow the installation instructions on their site to install it. Once you have Docker, starting to work with Pipelinewise is as easy as cloning their Github repository and building a Docker image from it. This image contains the necessary taps & targets.

To replicate the Singer pipeline from the previous blog post, we will need tap-oracle, target-s3-csv and transform-field. The first one is actually a Pipelinewise version of the Singer tap-oracle. Curiously, it has better documentation about the configuration than the original. Specifying this tap as a Pipelinewise connector during the docker build will also automatically install the required Oracle Instant Client.

This takes some time to finish (about 6 minutes on my machine), after which we can create the actual pipeline. You can also name your image differently or use a different version than latest. In that case, change the values in the bin/pipelinewise-docker file. This causes the alias to use your values instead.

The command above creates a new directory called anonimization-pipeline in the pipelinewise directory that was created by checking out the repository. There are a bunch of sample files in this directory. To replicate our Singer pipeline, we can just delete all of them except for tap_oracle.yml.sample and target_s3_csv.yml.sample. For these 2 files, we’ll need to remove the .sample extension.

Now, edit the tap_oracle.yml file so it has the contents below:

A Pipelinewise tap serves 2 purposes at once. It’s the extraction part, but also, by means of the transformations section for a table, the transformation part of the pipeline. Next, edit the target_s3_csv.yml file and give it the contents below:

Lastly, import the pipeline:

This imports the pipeline. It also does the tap catalog discovery that we manually did in the previous instalment of this blog post. When we run the pipelinewise status command, our pipeline should be correctly imported, enabled and ready to run.

Running the pipeline

This leaves us with just running the actual pipeline. Do this with the pipelinewise run_tap command:

The output shows that our pipeline has run successfully. A new CSV file has been added to our bucket:

Moreover, the file contains the correct information:

External configuration

Now we have something that’s able to replicate our pipeline from the previous blog post, but this time using a Docker based approach. It’s now able to run in all contexts that accept Docker images. This ranges from a simple physical server or a compute instance where you install and run Docker or Docker Compose yourself, to a container as a service setup like Amazon ECS and up to a full blown Kubernetes cluster.

The only thing that’s still missing to complete our cloud native scenario is to have external configuration for our pipeline. We need this to pass on all values that differ between environments instead of having to create specific configuration files for each environment.

Luckily, Pipelinewise also supports this. We just need to use the env_var notation for every value that we want to inject. Here’s an example of this for the S3 secrets:

The only thing that's still missing to complete the cloud native scenario is to have external configuration for our pipeline, which Pipelinewise ELT supports.

When using Docker Compose, this works correctly out of the box. You can specify environment variables in your docker-compose.yml or in a .env file as you can see in Sample Project for Docker Development Environment. In order to get this to work locally, however, I needed to change my bin/pipelinewise-docker file slightly. This change is needed to pass environment variables to the normal command in a way that the docker run inside it also uses them:

With the above changes, the pipelinewise import then becomes:

Conclusion

This post has shown that the conversion from Singer to Pipelinewise is pretty easy. It gives us a Docker based system that can easily be used locally or in the cloud with externalised configuration. Scheduling isn’t part of Pipelinewise itself but is easy to achieve. Depending on your choice of deployment, you can do it with a simple cron or something like Apache Airflow.

For me, this was an interesting POC to do. I was pleasantly surprised to find a bunch of easy tools that are simple to use and can be stitched together. There were some small issues along the way, but nothing insurmountable.

The only remark that I have is that while Singer and Pipelinewise work perfectly and will continue to work, they seem to be used mostly these days in other (SaaS) products. That means the free and open parts might not be under very active development anymore, so that’s something to keep in mind.

More cool software development blog posts:

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x