Rust: How We Built a Privacy Framework for Data Science
By Edgar Huneau

Rust: How We Built a Privacy Framework for Data Science

We could have built our privacy framework BastionLab in any language - Python, for example, which is data science’s beloved. But we chose Rust because of its efficiency and security features. Here are the reasons why we loved doing so, but also some challenges we encountered along the way.

Daniel Huynh,
Mehdi BESSAA

Data science’s love language is Python, but let’s be honest, it is hard to keep track of what goes on there. Dependencies can be obscure, and the code is so dynamic that it’s difficult to screen it, the performance can be uneven… When we started working on BastionLab, our new project for a privacy framework covering data exploration and AI training for data science collaboration, we realized it would not be our best choice. We needed a language that would be much stronger in terms of transparency, performance, and security: Rust.

Key Takeaways:

  1. Rust provides transparency, performance, and security advantages over Python for data science projects like BastionLab.
  2. While Rust's ecosystem is still young, the project successfully overcame challenges through external library calls.
  3. Rust's features, like Generic Algebraic Data Types, contribute to the ongoing development of BastionLab and Differential Privacy.

A Rust Journey

From the beginning, our goal when building BastionLab was (and still is) to offer guarantees to data owners that only the data they allow to be shown is shown and that only the operations they authorize are executed. Luckily for us, we’d already worked with Rust on our previous project, BlindAI, a confidential inference server to deploy AI models with privacy guarantees. We had started that project with C++, but it always frustrated us when verifying our implementations to avoid memory leaks, memory corruptions, multi-threading issues, etc… We stumbled upon a community-made Rust library by chance, decided to try it and quickly fell in love with its strict memory management and efficiency. Thanks to that find, our team could build on their previous technical knowledge to build BastionLab.

Rust also had one big advantage over Python: it lets us fully use Polars, the fastest data science library in the market, by a large margin. This was great because enforcing privacy over their dataset wouldn’t slow down the scientist's work. It also wouldn’t change their likely habit of using Pandas, the most popular data science library, as the syntax of both libraries is basically the same (with only a few best practice differences).

Clean Memory and Cargo’s Transparency

One thing we love about Rust is its strict memory management. It forces the developer to structure their project in a safe way, which makes the code more durable and helps maintain it. But in our case, we especially found it practical because we know that our users will have to go through potentially gigantic amounts of data. We can’t only rely on Rust’s memory management of course, but it does help a lot that it’s so well thought out and implemented. It lets us have very clean memory handling, where it would have been prone to bugs in other languages.

The following function, for example, converts a RemoteDataFrame (BastionLab’s main object) into a RemoteArray (another BastionLab's object) for AI training preparation. This is a process that requires many checks at every step. Rust makes it really easy to tell if there are errors and where they are right away, thanks to map_err.

async fn conv_to_array(
        &self,
        request: Request<RemoteDataFrame>,
    ) -> Result<Response<RemoteArray>, Status> {
        let identifier = request.into_inner().identifier;

        /*
            Convert to array would have to branch (unless there's a state machine) introduce similar to CompositePlan
            If there aren't strings and lists in the dataframe, and all the types are the same,
            then we use `dataframe_to_ndarray`
        */

        let df = self.polars.get_df_unchecked(&identifier)?;

        let dtypes = df
            .dtypes()
            .iter()
            .map(|d| d.to_string())
            .collect::<Vec<_>>();

        let dtype_exists =
            |dtype: &str| -> bool { dtypes.iter().find(|s| s.contains(dtype)).is_some() };
        let arr = if !(dtype_exists("list") || dtype_exists("utf8")) {
            RemoteArray {
                identifier: (self.df_to_ndarray(&df)?),
            }
        } else if dtype_exists("list") {
            /*
               Here, we assume we have a List[PrimitiveType]
               The idea would be to convert columns into ArrayBase -> merge Vec<ArrayBase> -> ArrayBase
            */
            let col_names = df.get_column_names();
            let vec_series = df
                .columns(&col_names[..])
                .map_err(|e| Status::aborted(format!("Could not get Series in DataFrame: {e}")))?;

            let mut out = vec![];
            for series in vec_series {
                out.push(to_status_error(self.list_series_to_array_store(series))?);
            }

            /*
                Here, we stack on Axis(1) because we would want to create [n_rows, m_cols, k_elems_in_each_item];
            */
            let array = ArrayStore::stack(Axis(1), &out[..])?;
            RemoteArray {
                identifier: self.polars.insert_array(array),
            }
        } else {
            return Err(
                Status::aborted("DataFrame with str columns cannot be converted directly to RemoteArray. Please tokenize strings first"));
        };

        Ok(Response::new(arr))
    }

Another great advantage of Rust is Cargo, its package manager. You can see here our compilation file and notice how easy it is to read and set up (especially compared to other low-level languages like C or C++ requiring Makefiles or CMakefiles…).

[package]
name = "bastionlab"
version = "0.3.7"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[workspace]
members = [
  "bastionlab_common",
  "bastionlab_torch",
  "bastionlab_learning",
  "bastionlab_conversion",
  "bastionlab_polars",
]
resolver = "2"

[dependencies]
bytes = "1.3.0"
tonic = { version = "0.5.2", features = ["tls", "transport"] }
prost = { version = "0.8", default-features = false, features = [
  "prost-derive",
] }
tokio = { version = "1.19.2", features = ["macros", "rt-multi-thread", "net"] }
tokio-stream = "0.1"
serde = "1.0.147"
serde_derive = "1.0.147"
serde_json = "1.0.87"
base64 = "0.13.1"
rand = "0.8.5"
ring = "0.16.20"
hex = "0.4.3"
x509-parser = "0.14.0"
spki = "0.6.0"
http = "0.2.8"
anyhow = "1.0.66"
toml = "0.5.9"
whoami = "1.2.1"
once_cell = "1.13.1"
log = "0.4.17"
env_logger = "0.9.0"
bastionlab_common = { path = "./bastionlab_common" }
bastionlab_polars = { path = "./bastionlab_polars" }
bastionlab_torch = { path = "./bastionlab_torch" }
bastionlab_conversion = { path = "./bastionlab_conversion" }

[dependencies.uuid]
version = "1.1.2"
features = [
  "v4", # Lets you generate random UUIDs
  "fast-rng", # Use a faster (but still sufficiently random) RNG
  "macro-diagnostics", # Enable better diagnostics for compile-time UUIDs,
  "serde",
]

[build-dependencies]
tonic-build = "0.5"

BastionLab's Cargo file

Cargo’s compilation process is very transparent in ways that other package managers aren’t because it only requires a single command, β€˜cargo tree,’ to show its whole structure. For example, it’s very easy to specify the target on which we want to deploy our application - which makes it very easy to compile BastionLab on Windows or Mac without having to change much of anything to the code.

bastionlab v0.3.3 (/home/mithril-dev/bastionlab/server/executable)
β”œβ”€β”€ bastionlab-common v0.3.3 (/home/mithril-dev/bastionlab/server/common)
β”‚   β”œβ”€β”€ anyhow v1.0.66
β”‚   β”œβ”€β”€ base64 v0.13.1
β”‚   β”œβ”€β”€ bincode v1.3.3
β”‚   β”‚   └── serde v1.0.148
β”‚   β”‚       └── serde_derive v1.0.148 (proc-macro)
β”‚   β”‚           β”œβ”€β”€ proc-macro2 v1.0.47
β”‚   β”‚           β”‚   └── unicode-ident v1.0.5
β”‚   β”‚           β”œβ”€β”€ quote v1.0.21
β”‚   β”‚           β”‚   └── proc-macro2 v1.0.47 (*)
β”‚   β”‚           └── syn v1.0.104
β”‚   β”‚               β”œβ”€β”€ proc-macro2 v1.0.47 (*)
β”‚   β”‚               β”œβ”€β”€ quote v1.0.21 (*)
β”‚   β”‚               └── unicode-ident v1.0.5
β”‚   β”œβ”€β”€ bytes v1.3.0
β”‚   β”œβ”€β”€ env_logger v0.9.3
β”‚   β”‚   β”œβ”€β”€ atty v0.2.14
β”‚   β”‚   β”‚   └── libc v0.2.137
β”‚   β”‚   β”œβ”€β”€ humantime v2.1.0
β”‚   β”‚   β”œβ”€β”€ log v0.4.17
β”‚   β”‚   β”‚   └── cfg-if v1.0.0
β”‚   β”‚   β”œβ”€β”€ regex v1.7.0
β”‚   β”‚   β”‚   β”œβ”€β”€ aho-corasick v0.7.20
β”‚   β”‚   β”‚   β”‚   └── memchr v2.5.0
β”‚   β”‚   β”‚   β”œβ”€β”€ memchr v2.5.0
β”‚   β”‚   β”‚   └── regex-syntax v0.6.28
β”‚   β”‚   └── termcolor v1.1.3
β”‚   β”œβ”€β”€ hex v0.4.3
β”‚   β”œβ”€β”€ http v0.2.8
β”‚   β”‚   β”œβ”€β”€ bytes v1.3.0
β”‚   β”‚   β”œβ”€β”€ fnv v1.0.7
β”‚   β”‚   └── itoa v1.0.4
β”‚   β”œβ”€β”€ log v0.4.17 (*)
β”‚   β”œβ”€β”€ once_cell v1.16.0
β”‚   β”œβ”€β”€ polars v0.25.1
β”‚   β”‚   β”œβ”€β”€ polars-core v0.25.1
β”‚   β”‚   β”‚   β”œβ”€β”€ ahash v0.8.2
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ cfg-if v1.0.0
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ getrandom v0.2.8
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ cfg-if v1.0.0
β”‚   β”‚   β”‚   β”‚   β”‚   └── libc v0.2.137
β”‚   β”‚   β”‚   β”‚   └── once_cell v1.16.0
β”‚   β”‚   β”‚   β”‚   [build-dependencies]
β”‚   β”‚   β”‚   β”‚   └── version_check v0.9.4
β”‚   β”‚   β”‚   β”œβ”€β”€ anyhow v1.0.66
β”‚   β”‚   β”‚   β”œβ”€β”€ arrow2 v0.14.2

BastionLab's cargo tree structure

This transparency is especially necessary when we need to use BastionLab in a Trusted Execution Environment because we’re dealing with untrusted infrastructure. We’ll go over what a trusted execution environment (or TEE) is in an instant, but this particular compatibility issue of our project strongly drove our choice towards Rust from the get-go.

What is a Trusted Execution Environment?

It is a quite complex hardware-based technology that we won’t go too much into detail about since it’s not at all the focus of this piece. What you need to know is that:

- It’s an isolated execution environment within the processor in which code can be run.

- It cannot be accessed by any human; only the machine will β€˜see’ the data.

- It has a whole system of attestations to guarantee that the code run in there is the code that was sent.

The key point here is that TEEs are only safe if the code running in them is trusted. Rust makes this much easier because it’s low level, so the code can be stripped of all its unnecessary parts, and cargo allows for such transparent management of the libraries. In Python, imports would have injected a lot of code that’s difficult to track. With Rust, all code imported can be easily controlled.

Faster than Light

But Rust is also much, much faster than Python because it’s a low-level language like C. It does apply some additional security checks, which make it lose a bit of speed in comparison, like forcing the developer to initialize all declared values. But overall, it has all the advantages of C without the memory constraints and chances to shoot yourself in the foot because of it. Add asynchronous programming and you’ve got mad multi-threading... Our CTO would say it’s magical ✨

This speed is reflected in Polar’s execution time, which is much, much faster than Pandas. For data science collaboration, this is great because the datasets are often very big when they are particularly sensitive. The user can feel the improvement significantly. The way Pandas works is that it will go through the whole dataset when performing any query. If it’s a little dataset, it’s not really an issue, but with a very big one, it will also go through all the cells even if nothing will be done on them. This is called an Eager execution. Polars go with a different method called Lazy mode. Thanks to optimizations, it will only treat the data frame cells that are meant to be modified and only start the request when it is really needed. This integration makes a world’s difference when exploring big data - and dealing with Polars directly in Rust allows us to exploit all of its options and configurations. You can see how much in the benchmarks we ran with BastionLab, Polars, and Pandas.

Rust Is Still Immature

For all the great sides of Rust, there is one thing where it falls short, and that is that its environment is still quite young. This was a challenge when dealing with library bindings for Py Torch, for example. Many key libraries aren’t available yet to be installed through Cargo.

We did solve it by doing an external call to the library, which doesn’t slow down the project or doesn’t compromise its safety. It’s also not that hard to maintain. But it was complicated to do and the resulting code wasn’t as elegant as it could have been. We used an existing base for the library bindings, but we had to take additional measures to ensure it was well done, worked well in the project, and that the library was in the right place… With Cargo, all of this is automatic, and we had to use hacks to go further.

BastionLab’s Future

We are still building BastionLab’s features, and Rust continues making that easier for us. For example, we’ve been enjoying many of its features since we started implementing Differential Privacy, a mathematical framework that prevents inferring the input from the output to ensure privacy. Generic Algebraic Data Types were particularly helpful in representing query plans in a natural manner that facilitated their analysis.

BastionLab is open source, so you can check our code on GitHub and help us make it as easy to use and safe as possible. You can go try it with our Quick Tour! It shows how the Titanic dataset can be shared with a remote data scientist while making sure only anonymized results from the passengers aboard the ship are communicated. You can also check our how-to guides for more real-life examples (Covid datasets or bank fraud).

If you are interested, you can also join us on our Discord! We're always very happy to hear your feedback πŸ¦€πŸ¦€

Want to turn your SaaS into a zero-trust solution?

Image credits: Edgar Huneau