Introducing BastionLab - A Simple Privacy Framework for Data Science Collaboration
BastionLab is a simple privacy framework for data science collaboration. It lets data owners protect the privacy of their datasets and enforces that only privacy-friendly operations are allowed on the data and anonymized outputs are shown to the data scientist.
As the field of data science grows collaborative and Cloud-based, the amount of data that could be shared between data owners and data scientists could lead the way to amazing new insights.
Those projects are crucial, but data owners with sensitive information often have to give up before they can even try, for fear of their data being exposed. Privacy and security issues are too high risk for many datasets, such as patient records, financial data, or biometric data.
Weβve wanted to solve this problem for a year and a half at Mithril Security. We have built BlindAI, then BastionAI, two tools to help train and deploy models in a Trusted Execution Environment so that data can be shared with AI models without ever exposing the data clearly to anyone else. Making deep learning training and deployment more privacy-friendly helps cover many scenarios considered too sensitive today, such as facial recognition or speech recognition for therapy sessions.
Adding privacy to data exploration and ML
However, many of our users gave us feedback that deep learning was not enough: they wanted to open access to their datasets for remote data exploration, statistics, and machine learning. Data owners, such as hospitals, encounter many challenges to onboard third parties, like startups or researchers, because it is very hard to control what is shared.
- If you want an interactive remote collaboration, the most common solution is to give access to a Jupyter Notebook. But even if you isolate Jupyter, the dataset shared can be fully recovered. It is so easy that you could use a simple 'for' loop to print a whole dataset and extract it without the data owner realizing it. This can happen because the execution of arbitrary Python code is allowed in Jupyter Notebooks to ensure interactive collaborations, and there are no native tools to filter queries.
- Another method is for the data scientist to send a Python script with all the operations to the server. Unless the data owner manually verifies the whole script for malicious code, data could still be easily exfiltrated. Even careful review doesnβt entirely remove the risk, because human errors need to be taken into account. On top of that, the whole process has to be repeated every time something changes, which is tedious and represents a huge organizational cost. Most of our clients explained they would just give up that option from the get-go.
Finally, both cases require that the raw data be thoroughly sanitized before sharing it with a third party. Current tools arenβt efficient with this task at all because they lack privacy features. As a result, anonymization is often done manually, which is extremely time-consuming and often done poorly. It also creates non-negligible privacy risks of deanonymization (an attack that correlates patterns in the anonymous data to identify the person, usually by using third-party public databases such as voters in the US).
To answer all those issues, we realized we would need a solution that would control remote access to datasets and allow privacy-friendly data science work. It would be interactive, fast, fit easily in the usual data science workflow of both data owners and data scientists, ensure privacy, and be secure.
BastionLab, our simple privacy framework for data science collaboration, was born! It has features for both data owners and data scientists and covers exploratory data analysis and, soon, AI training.
For Data Owners: a new way to open data access while staying in control
- Selective Dataset Sharing & Filtering Data Queries
With BastionLab, data extracts no longer need to be manually cleaned, and data access is no longer unrestricted. Datasets can be uploaded inside a secure environment that will filter incoming data scientistsβ queries. It will only apply computations that are compliant with a privacy policy the data owner will define in the framework.
>>> import polars as pl
>>> df = pl.read_csv("titanic.csv")
For example, letβs define a policy where we accept by default all queries with at least 10 people by sample. Due to that, we'll establish a safe zone so that queries that do not meet these requirements effectively go 'out of the safe zone'. We can define how the server should react in those cases:
- Approve queries automatically but log them with a red flag.
- Refuse queries automatically and log them with a red flag.
- Block processing until the data owner manually approves or rejects the query.
>>> from bastionlab.polars.policy import Policy, Aggregation, Review
>>> policy = Policy(safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Review())
- Traceability
>>> client.upload_df(
... df,
... policy=Policy(safe_zone=Rule(min_aggregation_size=10),
... unsafe_handling=βlogβ),
... sanitized_columns=[βNameβ],
... )
BastionLab traces every activity done on the dataset, and we can know at any time what has been shared with the data scientist at a granular level, like which rows or columns have been used or revealed.
For Data Scientists: explore datasets remotely with privacy
Data scientists can now work on exploring and cleaning datasets remotely while providing technical guarantees that they respect data owners' access policies.
- A similar DataFrame API
Remote datasets are manipulated through our Python SDK. We provide a private version of a DataFrame: the RemoteLazyFrame. This is done thanks to the use of Polars, a faster equivalent of Pandas.
- Live queries auto-validation
Data scientistsβ requests are analyzed by BastionLab and matched against the privacy policy defined previously by the data owner.
If the request is approved, itβs executed:
>>> per_class_rates = (
... remote_df
... .select([pl.col("Pclass"), pl.col("Survived")])
... .groupby(pl.col("Pclass"))
... .agg(pl.col("Survived").mean())
... .sort("Survived", reverse=True)
... .collect()
... .fetch()
... )
per_class_rates
shape: (3, 2)
ββββββββββ¬βββββββββββ
β Pclass β Survived β
β --- β --- β
β i64 β f64 β
ββββββββββͺβββββββββββ‘
β 1 β 0.62963 β
ββββββββββΌβββββββββββ€
β 2 β 0.472826 β
ββββββββββΌβββββββββββ€
β 3 β 0.242363 β
ββββββββββ΄βββββββββββ
If a request is denied, it comes with an explanation and an option to request access from the data owner:
>>> remote_df.head(5).collect().fetch()
"Warning: non privacy-preserving queries necessitate data owner's approval.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner."
Ready to try BastionLab?
You can check our Quick Tour to learn how the Titanic dataset can be explored with privacy guarantees using BastionLab!
For now, BastionLab works with simple scenarios, but we have many plans for it in the future. We want to improve security by combining Differential Privacy with Confidential Computing so multiple data owners can share their data together.
Like our previous products, BastionLab is open-source, so you can check our code online and join our efforts to make this as easy to use and safe as possible.
If you are interested, drop a star on GitHub, and join us on our Discord!