Rust: How We Built a Privacy Framework for Data Science
We could have built our privacy framework BastionLab in any language - Python, for example, which is data scienceβs beloved. But we chose Rust because of its efficiency and security features. Here are the reasons why we loved doing so, but also some challenges we encountered along the way.
Data scienceβs love language is Python, but letβs be honest, it is hard to keep track of what goes on there. Dependencies can be obscure, and the code is so dynamic that itβs difficult to screen it, the performance can be unevenβ¦ When we started working on BastionLab, our new project for a privacy framework covering data exploration and AI training for data science collaboration, we realized it would not be our best choice. We needed a language that would be much stronger in terms of transparency, performance, and security: Rust.
Key Takeaways:
- Rust provides transparency, performance, and security advantages over Python for data science projects like BastionLab.
- While Rust's ecosystem is still young, the project successfully overcame challenges through external library calls.
- Rust's features, like Generic Algebraic Data Types, contribute to the ongoing development of BastionLab and Differential Privacy.
A Rust Journey
From the beginning, our goal when building BastionLab was (and still is) to offer guarantees to data owners that only the data they allow to be shown is shown and that only the operations they authorize are executed. Luckily for us, weβd already worked with Rust on our previous project, BlindAI, a confidential inference server to deploy AI models with privacy guarantees. We had started that project with C++, but it always frustrated us when verifying our implementations to avoid memory leaks, memory corruptions, multi-threading issues, etcβ¦ We stumbled upon a community-made Rust library by chance, decided to try it and quickly fell in love with its strict memory management and efficiency. Thanks to that find, our team could build on their previous technical knowledge to build BastionLab.
Rust also had one big advantage over Python: it lets us fully use Polars, the fastest data science library in the market, by a large margin. This was great because enforcing privacy over their dataset wouldnβt slow down the scientist's work. It also wouldnβt change their likely habit of using Pandas, the most popular data science library, as the syntax of both libraries is basically the same (with only a few best practice differences).
Clean Memory and Cargoβs Transparency
One thing we love about Rust is its strict memory management. It forces the developer to structure their project in a safe way, which makes the code more durable and helps maintain it. But in our case, we especially found it practical because we know that our users will have to go through potentially gigantic amounts of data. We canβt only rely on Rustβs memory management of course, but it does help a lot that itβs so well thought out and implemented. It lets us have very clean memory handling, where it would have been prone to bugs in other languages.
The following function, for example, converts a RemoteDataFrame (BastionLabβs main object) into a RemoteArray (another BastionLab's object) for AI training preparation. This is a process that requires many checks at every step. Rust makes it really easy to tell if there are errors and where they are right away, thanks to map_err.
async fn conv_to_array(
&self,
request: Request<RemoteDataFrame>,
) -> Result<Response<RemoteArray>, Status> {
let identifier = request.into_inner().identifier;
/*
Convert to array would have to branch (unless there's a state machine) introduce similar to CompositePlan
If there aren't strings and lists in the dataframe, and all the types are the same,
then we use `dataframe_to_ndarray`
*/
let df = self.polars.get_df_unchecked(&identifier)?;
let dtypes = df
.dtypes()
.iter()
.map(|d| d.to_string())
.collect::<Vec<_>>();
let dtype_exists =
|dtype: &str| -> bool { dtypes.iter().find(|s| s.contains(dtype)).is_some() };
let arr = if !(dtype_exists("list") || dtype_exists("utf8")) {
RemoteArray {
identifier: (self.df_to_ndarray(&df)?),
}
} else if dtype_exists("list") {
/*
Here, we assume we have a List[PrimitiveType]
The idea would be to convert columns into ArrayBase -> merge Vec<ArrayBase> -> ArrayBase
*/
let col_names = df.get_column_names();
let vec_series = df
.columns(&col_names[..])
.map_err(|e| Status::aborted(format!("Could not get Series in DataFrame: {e}")))?;
let mut out = vec![];
for series in vec_series {
out.push(to_status_error(self.list_series_to_array_store(series))?);
}
/*
Here, we stack on Axis(1) because we would want to create [n_rows, m_cols, k_elems_in_each_item];
*/
let array = ArrayStore::stack(Axis(1), &out[..])?;
RemoteArray {
identifier: self.polars.insert_array(array),
}
} else {
return Err(
Status::aborted("DataFrame with str columns cannot be converted directly to RemoteArray. Please tokenize strings first"));
};
Ok(Response::new(arr))
}
Another great advantage of Rust is Cargo, its package manager. You can see here our compilation file and notice how easy it is to read and set up (especially compared to other low-level languages like C or C++ requiring Makefiles or CMakefilesβ¦).
Cargoβs compilation process is very transparent in ways that other package managers arenβt because it only requires a single command, βcargo tree,β to show its whole structure. For example, itβs very easy to specify the target on which we want to deploy our application - which makes it very easy to compile BastionLab on Windows or Mac without having to change much of anything to the code.
This transparency is especially necessary when we need to use BastionLab in a Trusted Execution Environment because weβre dealing with untrusted infrastructure. Weβll go over what a trusted execution environment (or TEE) is in an instant, but this particular compatibility issue of our project strongly drove our choice towards Rust from the get-go.
What is a Trusted Execution Environment?
It is a quite complex hardware-based technology that we wonβt go too much into detail about since itβs not at all the focus of this piece. What you need to know is that:
- Itβs an isolated execution environment within the processor in which code can be run.
- It cannot be accessed by any human; only the machine will βseeβ the data.
- It has a whole system of attestations to guarantee that the code run in there is the code that was sent.
The key point here is that TEEs are only safe if the code running in them is trusted. Rust makes this much easier because itβs low level, so the code can be stripped of all its unnecessary parts, and cargo allows for such transparent management of the libraries. In Python, imports would have injected a lot of code thatβs difficult to track. With Rust, all code imported can be easily controlled.
Faster than Light
But Rust is also much, much faster than Python because itβs a low-level language like C. It does apply some additional security checks, which make it lose a bit of speed in comparison, like forcing the developer to initialize all declared values. But overall, it has all the advantages of C without the memory constraints and chances to shoot yourself in the foot because of it. Add asynchronous programming and youβve got mad multi-threading... Our CTO would say itβs magical β¨
This speed is reflected in Polarβs execution time, which is much, much faster than Pandas. For data science collaboration, this is great because the datasets are often very big when they are particularly sensitive. The user can feel the improvement significantly. The way Pandas works is that it will go through the whole dataset when performing any query. If itβs a little dataset, itβs not really an issue, but with a very big one, it will also go through all the cells even if nothing will be done on them. This is called an Eager execution. Polars go with a different method called Lazy mode. Thanks to optimizations, it will only treat the data frame cells that are meant to be modified and only start the request when it is really needed. This integration makes a worldβs difference when exploring big data - and dealing with Polars directly in Rust allows us to exploit all of its options and configurations. You can see how much in the benchmarks we ran with BastionLab, Polars, and Pandas.
Rust Is Still Immature
For all the great sides of Rust, there is one thing where it falls short, and that is that its environment is still quite young. This was a challenge when dealing with library bindings for Py Torch, for example. Many key libraries arenβt available yet to be installed through Cargo.
We did solve it by doing an external call to the library, which doesnβt slow down the project or doesnβt compromise its safety. Itβs also not that hard to maintain. But it was complicated to do and the resulting code wasnβt as elegant as it could have been. We used an existing base for the library bindings, but we had to take additional measures to ensure it was well done, worked well in the project, and that the library was in the right placeβ¦ With Cargo, all of this is automatic, and we had to use hacks to go further.
BastionLabβs Future
We are still building BastionLabβs features, and Rust continues making that easier for us. For example, weβve been enjoying many of its features since we started implementing Differential Privacy, a mathematical framework that prevents inferring the input from the output to ensure privacy. Generic Algebraic Data Types were particularly helpful in representing query plans in a natural manner that facilitated their analysis.
BastionLab is open source, so you can check our code on GitHub and help us make it as easy to use and safe as possible. You can go try it with our Quick Tour! It shows how the Titanic dataset can be shared with a remote data scientist while making sure only anonymized results from the passengers aboard the ship are communicated. You can also check our how-to guides for more real-life examples (Covid datasets or bank fraud).
If you are interested, you can also join us on our Discord! We're always very happy to hear your feedback π¦π¦
Want to turn your SaaS into a zero-trust solution?
Image credits: Edgar Huneau