Our Roadmap to Build a Simple Privacy Toolkit for Data Science Collaboration
By Edgar Huneau

Our Roadmap to Build a Simple Privacy Toolkit for Data Science Collaboration

One year and a half later, Mithril Security’s roadmap has transformed significantly, but our initial goal stayed the same: democratizing privacy in data science.

Daniel Huynh
Update Sept. 2023: Since this article's publication, Mithril Security has expanded its product range to enhance support for LLM deployments. While BlindAI still ensures complete data confidentiality, we've introduced BlindLlama—a high-performance, secure solution for LLM deployment on GPUs. We've also made strategic decisions by discontinuing BastionLab in favor of more effective solutions.
For a comprehensive view of our future plans and product developments, check out our recently published roadmap.

One year and a half later, Mithril Security’s roadmap has transformed significantly, but our initial goal stayed the same: democratizing privacy in data science.

We now aim to provide a simple privacy toolkit for data science collaboration, covering all the steps: data exploration, AI training, machine learning, and deployment.

Our initial approach was to use secure enclaves with Intel SGX to deploy AI models with privacy, which had us build BlindAI. After that, we moved towards AI training with other Trusted Execution Environments (TEEs) in general (like AMD SEV-SNP) with BastionAI.

We are now adding BastionLab, a framework for data owners to control data scientists' access to their open data.

In this roadmap, we’ll present the evolution of our products, where we are making them, and where we are headed in the future.

Phase 1 - AI deployment

Our first product, BlindAI, made it much easier to increase data confidentiality when sent to AI models in production. You can now deploy state-of-the-art AI models like OpenAI Whisper or GPT-Neo with privacy in two lines of code (literally). This was made possible by BlindAI Cloud, a managed solution that relies on Intel SGX to protect data in use with end-to-end encryption.

Security is guaranteed by secure enclaves’ attestations, all operators needed for AI inference, encryption, etc.

Since we have covered most of the different pieces needed to deploy AI models with privacy, we will focus less on BlindAI.

The only thing left for us to do is to assess the security level our solution has reached. We will do so in January 2023 by performing an external security audit, which will be made public.

Phase 2 - AI training

The second step was logical. Since we could deploy AI models with privacy, we also wanted to be able to train them with privacy.

Many scenarios involve the training or finetuning of AI models, not just deployment. When Nvidia Confidential GPUs were announced, we saw an opportunity to create a new generation of privacy-by-design AI tools. With it, we could help data scientists train their models on confidential data with privacy while allowing a good performance.

The BastionAI project was born! BastionAI is a confidential training framework relying on TEEs to tackle the challenge of training on confidential data. By implementing a fortified learning approach instead of federated, we have been able to answer deployment, scalability, and security challenges faced by federated learning.

BastionAI has provided promising initial performance in the multiparty training setup. On some benchmarks (the training of an EfficientNet model, for example), we were 3 times faster than PyTorch with Flower, the PyTorch Federated Learning framework.

Phase 3 - A privacy toolkit for data science collaboration

While our work on BastionAI and BlindAI has helped answer the privacy needs in deep learning training and deployment, many use cases do not require the use of deep learning.

In many scenarios, data owners only need privacy guarantees when they open access to their data to remote data scientists so they can perform data exploration, statistics, or simple ML.

That is why we have broadened BastionAI to include other steps of the data science pipeline: data exploration, visualization, statistics, and ML.

So BastionLab was born: a privacy toolkit for every step of the data science workflow.

BastionLab aims to help data owners provide access to remote data scientists, so they can use their data while minimizing data exposure as much as possible. By taking advantage of various privacy technologies, such as TEEs, Differential Privacy, Access control data science, and PKI, we aim to provide a comprehensive solution to cover different scenarios:

  • One data owner exposes their data to remote data scientists while ensuring that only anonymized data leaves their infrastructure.
  • Multiple data owners gather their data together to train an AI model without exposing their data to any other party in clear.

To get started, you can visit our Quick Tour tutorial. It shows how a data owner can put in place a privacy policy when exposing a dataset (we used the Titanic dataset as an example) and allows an external data scientist to explore it while always respecting the defined privacy policy.

Next steps

Our roadmap is public on Notion and can be found here.

In the coming months

We will focus first on improving the tooling for Exploratory Data Analysis, including data frame handling with Polar or data visualization with Seaborn. The merge with BastionAI will be done soon.

We also have on the roadmap a few security and privacy core features:

  • Authentication with PKI
  • Access control data science with a privacy policy
  • Differential Privacy

In the coming years

In the future, we want BastionLab to cover the full data scientist stack to enable privacy-friendly data sharing. We have different high-level steps in mind to achieve this.

⸱ Coverage of other frameworks

Many frameworks have to be made private, either through a finer access control policy or by incorporating them in TEEs.

To name a few: Tensorflow, Spark, pandas…

⸱ Broader deployment methods

Being able to deploy BastionLab easily is key. Therefore, we plan to make it easy to integrate it into existing Cloud infrastructures such as GCP, Azure, or AWS.

⸱ Enterprise features

As our long-term goal is to enable organizations to collaborate together, providing a ready-to-go solution for enterprises is very important to us.

These include:

  • Hosting and management
  • User and role access management
  • Privacy compliance (GDPR, HIPAA, CCPA, etc.)
  • SSO

I hope our project has caught your interest, and we would be glad to have your opinion on it. Do not hesitate to star our GitHub and join our Discord to talk with us!

Want to turn your SaaS into a zero-trust solution?