ACA Blog

ACA Blog


August 2021
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Categories


How to set up simple and flexible ETL based anonymization – pt 2

Jan EerdekensJan Eerdekens

In the first installment of this series on ETL based anonymization, I covered the research part of the problem I was trying to solve in a proof of concept. I settled on using Singer as the software solution to build an ETL pipeline. In this second part I will focus on how to set up Singer, the chosen taps & targets and how to chain them together.

As a small side quest, I’ll also cover how to run an Oracle database locally as the database that the Singer Oracle Tap can connect to and the issues I had with it.

Set up Singer

The first thing we need to do to start creating the ETL pipeline is to install Singer. Singer is basically just a collection of taps and targets. These are nothing more than simple Python code that produces data streams in the Singer JSON spec format that can be piped into each other. So the only installation requirement is to have a working Python 3 installation on your system. To see if you already have Python 3 installed correctly, run the command below:

If this works and outputs a 3.x version then you’re already done. If this doesn’t work or outputs a 2.x version, you’ll first need to install Python 3. Instructions on how to do this for your operating system can be easily found by Googling.

Set up Oracle

We will be using an Oracle database that we will run as a Docker container. I went through many trials to get this set up, which you can read about at the end of this post.

For the time being, we’ll start from the assumption that an Oracle database is running on localhost:1521 with an SID called OraDoc. In this database, there’ll be a schema called ETL and a user called etl. In this schema we’ll have a very simple test table called TEST with the structure you see below and a couple of rows of actual test data:

Set up the “Tap”

With a working Python 3 installation, we can now continue with the first step of our ETL pipeline: the ingestion of Oracle database data by the Singer Oracle Tap.

Even though the taps and targets are just Python code, the Oracle Tap uses the cx_Oracle library as its database driver. This library needs a native Oracle database client, called Oracle Instant Client, to be able to connect to the database. To install the client follow these installation instructions.

Once this client is installed, we can continue with setting up the actual Oracle Tap. Installing it entails nothing more than creating a Python Virtual Environment, activating it and then installing the tap dependency via pip. For the Oracle Tap you can do this with these commands:

Once the tap is installed and the virtual environment has been activated, the tap-oracle command will be available:

From the message we get when just trying to run the tap, we can see that the tap needs some configuration before it can actually work. This configuration is nothing more than a simple JSON file that contains the information about the database connection and some other parameters to define how the database will be synced. To keep things simple we will create this file, called config.json, in the bin directory of our tap-oracle virtual environment with the content below:

The first 5 configuration parameters are mandatory and pretty self-explanatory, except maybe for the last two:

With this configuration in place, we’re ready to run the tap in discovery mode. In this mode, the tap uses the config file to connect to the database and query the selected schemas about the available tables and their structure. We need to save this data, called the catalog, to another JSON file. To create this file, catalog.json, run the command below from the bin directory of the tap-oracle virtual environment:

If you run the tap with the config and the catalog, you’ll see that nothing is synced:

To get the tap to actually sync something, edit the catalog file slightly by selecting the streams you want to sync:

If you now run the same command again, you’ll see the tap sync content of the selected stream:

That was already a lot of work, but luckily the hardest part is done!

Set up the “Target”

Now that we have a tap set up to retrieve database data in the Singer spec format, we’re going to skip the transformation step for the time being and set up a target to pipe it to first. This way we can quickly verify the Extract and the Load parts of the ETL pipeline and already see a readable CSV result instead of Singer spec JSON files.

We’ll be using the Pipelinewise S3 CSV Target. The setup of this target is very similar to how the tap was set up previously. Again, you’ll need to create a virtual environment for it, activate it and install the target dependency using pip:

This target also needs a small configuration file in the form of a config.json file that contains the name of the bucket we want to send the CSV files to and the credentials needed to access the bucket:

This simple configuration is all we need to make the target work. It will enable us to pipe the result of the Oracle Tap into this target and get, depending on the selected streams, one or more currently non-anonymized, CSV files in our S3 bucket. If we run the command below, from the bin directory of the activated tap-oracle virtual environment, a full table sync will occur:

You can check the result in the S3 web interface or using the AWS command line client. To see if the bucket contains the CSV files and to retrieve them, execute the commands below:

Or you can do a cat like thing:

Set up the “Transformation” / anonymization

Now that we’re able to retrieve data from our Oracle database, select the streams from it that we want and store them as CSV files in S3. We’re only missing one little, but very important step: anonymizing the data in transit between the tap and the target.

We’ll be using Pipelinewise Transform Field to achieve the transformation / anonymization. This piece of software basically represents the T that Singer as an EL tool is missing to make it an actual ETL tool.

To set up the transformation part, we’re staying in a familiar theme. Simply set up another virtual environment, activate it, install a dependency via pip and create a configuration file.

Create a config.json file in the bin directory of this virtual environment. This file contains a simple list of transformations. Such a transformation simply defines the type of transformation to apply to a specific field in a specific stream. The configuration in the example below has the result of HASHing the FIRST_NAME field in the TEST stream:

The list of available transformations in the Pipelinewise Transform Field step are:

While these transformations are sufficient for the anonymization needs of our POC case, they can be easily altered. You can even add custom transformations. To do this, edit the transform.py file found in the lib/python3.9/site-packages/transform_field directory of the current virtual environment.

Putting it all together

With these 3 virtual environments in place, we can pipe them all together to produce the end result we’re looking for. When we run the command below we should see the ETL pipeline sync the selected schema, anonymize specific parts and store the end result as a CSV file in our S3 bucket:

When we now check the CSV in our bucket:

Conclusion

In this blog post you’ve seen that we can easily use Singer to achieve the goals of the POC described in the previous blog post. Because it only requires some simple stuff like Python and an Oracle client, it can be installed and used on most systems and in most environments. It also uses some simple JSON based configuration files that can be easily tailored to a lot of use cases. The source of the taps and targets is also available and is usually easy to understand and adapt where needed. In case you can’t find a tap or target that suits you, writing one shouldn’t be too hard.

In the last installment of this series, I’ll show a way to simplify the whole system a bit more and package it in a way that’s more suited for cloud native usage/deployment.

Side quest: Oracle in Docker

Because I didn’t have access to an existing Oracle database, I needed to run one myself. You’d think that would be easy and quick, but as it turned out it was a bit more complicated and quite frustrating. Especially when compared to running a PostgreSQL or MySQL/MariaDB container using Docker.

Oracle XE was made available a long time ago. It’s a free version of an Oracle database with some limitations, but none of those were problematic for my use case. Oracle XE can be installed natively on Windows and Linux, but because I’m working on a Mac and run my other databases via Docker I wanted to run it using a Docker image.

To do this, I needed to build an Oracle docker image myself. There is a Github repository that contains a lot of Dockerfiles and scripts provided by Oracle. In this repository there is also a section for Oracle databases: Oracle Database container images. You can check out this repository, go to the OracleDatabase/SingleInstance/dockerfiles directory and run the buildContainerImage.sh script to build an actual Oracle Docker database container.

I wanted to set up an Oracle 12 to match the customer’s installation, but for that one I needed to download some database installation files first. However, it looks like those aren’t publicly available anymore (you can get them via an Oracle contract). So I settled for Oracle 18.4.0 XE:

On my MacBook Pro starting this Docker container takes more than 30 minutes. On top of that, it usually stops working after a couple of hours, needing a restart that will again take that much time. I tried a number of things, like attaching volumes to store all the data that gets generated during the first start, but that didn’t seem to help. It also uses a lot of resources on my machine.

Even after getting the container running, connecting to the database in it also wasn’t really straightforward because of the pluggable database stuff. Even when connected, creating a user and database specific for our POC and connecting with that one was again a lot of trouble: service name vs. SID and no service name support in the Oracle Python library that is used by the tap.

So in the end, after Googling some more, I found a Docker repository from Oracle that contains an old Oracle 12 docker container. You need a free Oracle account for it to be able to do the docker login. There is no volume support so you’ll lose your data in a lot of cases, but it is easy to set up, starts consistently in a couple of minutes and you can connect to it via an SID. Also adding a user and database wasn’t too difficult and so I ended up using this instead of the newer Github repository stuff.


More cool software development blog posts:

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Johan
Johan
3 months ago

Nice blog post!

Just wondering if you also looked into using AWS Glue as an ETL service, since AWS was already in the picture (S3 bucket). And if there were any reasons for deciding not to use this service?