ACA Blog

ACA Blog


May 2021
M T W T F S S
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Categories


How to set up simple and flexible ETL based anonymization – pt 1

Jan EerdekensJan Eerdekens

In this technical blog post, I want to talk about how to set up simple and flexible ETL based anonymization. Why? Well, I recently had the opportunity to do a small proof of concept for a customer. The customer wanted to know the options that were available that would enable them to take internal data, remove or anonymize any personally identifiable information (PII) and make it available in a simple way and form for external parties. After further requirements gathering, the context for this proof of concept was defined as:

In this 3 part blog series I’ll explain how to set up simple and flexible ETL based anonymization with the following subjects:

Research

The research part of the proof of concept consists of 2 parts:

Extracting, transforming and storing the data

Straight off the bat, the customer’s problem sounded remarkably as something that you might solve with an ETL product: Extract Transform Load. So the research part for this part of the proof of concept would be concentrated on this type of product. I also got some input from someone in my team to have a look at singer.io as that was something they had used successfully in the past for this kind of problem.

singer.io logo - singer.io can help us figure out how to set up simple and flexible ETL based anonymization.

When looking at the Singer homepage, there are a number things that immediately catch your eye:

So when getting down to the basics, Singer is a just a specification, albeit not an official one. It’s a simple JSON based data format and you can either produce something in this format (a tap in Singer terminology) or consume the format (a target). You’re able to chain these taps and targets together to extract data from one location and store it in another. Out of the box Singer already comes with a bunch of taps (100+) and targets (10). These taps and targets are written in Python. Because the central point of the system is just a data format, it’s pretty easy to write one yourself or adapt an existing one.

When checking out the taps, the default Oracle tap should cover the Extract part of our proof of concept. The same however doesn’t seem to be the case for the Load part when looking at the default targets. There is a CSV target, but it stores its results locally, not in an S3 bucket. There is the option of just using this target and do the S3 upload ourself after the ETL pipeline has finished. Another option would be to adapt the existing CSV target and change the file storage to S3. Some quick Googling turns up a community made S3 CSV Singer target. According to its documentation, this target should do exactly what we want.

Whoops, Singer doesn’t transform

With the Extract and Load parts covered, this leaves us with just the Transform part of the ETL pipeline to figure out… and this is where it gets a bit weird. Even tough Singer is classified as an ETL tool, it doesn’t seem to have support for the transformation part?

Looking further into this I came by this ominously titled post: Why our ETL tool doesn’t do transformations. Reading this, it seems they consider their JSON specification/data format as the transformation part. So they support transformation to raw data and storing it, but don’t support other kinds of transformations. That part is up to yourself after it has been stored somewhere by a Singer target. So it turns out that Singer is more like the EL part of an ELT product than an “old school” ETL product.

At this point, Singer should at least be sufficient to extract the data from an Oracle database and to put it in an S3 bucket in CSV format. And because Singer is pretty simple, open and extendable, I’m going to leave it at that for now. Let’s continue by looking into the anonymization options that might fit in this Singer context.

Data anonymization

Similarly to the ETL part, I also received some input for this part, pointing me to Microsoft Presidio.

Microsoft Presidio logo - Microsoft Presidio can help us figure out how to set up simple and flexible ETL based anonymization.

When looking at the homepage we can read the following:

So there’s a lot of promising stuff in there that could help me solve my anonymization needs. Upon further investigation it looks like I’m evaluating this product during a major transformation (get it? 😉) from V1 to V2. V1 incorporated some ETL-like stuff like retrieving data from sources (even though Oracle support in the roadmap never seems to have materialized) and storing anonymized results in a number of forms/locations. However, V2 has completely dropped this approach to concentrate purely on the detection and replacement of PII data.

At its core, Presidio V2 is a Python based system built on top of an AI model. This enables it to automatically discover PII data in text and images and to replace it according to the rules you define. I did some testing using their online testing tool and it kind of works, but for our specific context it definitely needs tweaking. Also, when looking at the provided test data, it seems that it is mostly simple and short data but no large text blobs or images. This then begs the question: even if we’re able to configure Presidio to do what we want it to do, might we be hitting small nails with a big hammer?

Meme image describing how Presidio V2 might be too complicated on how to set up simple and flexible ETL based anonymization

Is Presidio too much?

So let’s rethink this. If we can easily know and define which simple columns in which tables need to be anonymized and when just nulling or hashing the column values is sufficient, we don’t need the auto detection part of Presidio. We also wouldn’t need the Presidio full text or image support and we also wouldn’t need fancy substitution support. Presidio could be a powerful library to create an automatic anonymization transformation step for our Singer based pipeline. It also helps that Presidio is Python based. However, my gut feeling says I maybe should first try to find a slightly simpler solution.

Meme image describing how Presidio V2 might be too fancy, but other tools might be too simple on how to set up simple and flexible ETL based anonymization

I started searching for something that’s can do a simple PII replace and that works in a Singer tap/target context. I found this Github repository: pipelinewise-transform-field. The documentation reads “Transformation component between Singer taps and targets”. Sounds suspiciously like the “T” part that Singer as an ETL was missing! Further down in the configuration section we even read:

“You need to define which columns have to be transformed by which method and in which condition the transformation needs to be applied.”

and the possible transformation types are:

This seems to cover our simple anonymization requirements completely! We can even see how we need to use it in the context of Singer:

Conclusion

We now have all the pieces of the puzzle on how to set up simple and flexible ETL based anonymization. In the next blog post we’ll show how they fit together and whether they produce the results the customer is looking for.

Meme image describing man shouting that this could work, referring to the approach in this blog post

More cool software development blog posts:

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x