Build a privacy-by-design voice assistant with BlindAI

Discover how BlindAI can make AI voice assistant privacy-friendly!

Build a privacy-by-design voice assistant with BlindAI
Build a privacy-by-design voice assistant with BlindAI

We will see in this article how AI could answer the challenges of smart voice assistants, and how the privacy issues associated with the deployment of such AI can be overcome with BlindAI.


One of the most sought-after goals of AI is to develop a machine able to understand us, interact with us, and provide support in our everyday life just like in the movie Her.

Her (2013)

But where are we currently? We all remember the Google Duplex conference where we were promised a lifelike AI assistant, but such a companion still seems faraway today.

One of the key components of a real conversational AI is speech. Translating human speech into language is extremely hard. Nonetheless, recent approaches based on large scale unsupervised learning with Transformers-based models, like Wav2vec2, have opened new horizons. Such models can be trained on huge amounts of unlabelled data, break down sounds into small tokens, and leverage them with attention mechanisms.

Unfortunately, those approaches, relying on mountains of data to be trained, have one hidden cost: privacy.

Indeed, we have seen in the past that GAFAM’s voice assistant solutions have had privacy issues, with much more data  being recorded than announced. A lot of sensitive conversations had been recorded and used without people’s knowledge, causing massive uproar.

In view of these privacy breaches, should we refrain ourselves from developing those speech recognition AIs with life-changing potential? Do we have to choose between privacy and convenience?

At Mithril Security, we believe that there is a third way: democratise privacy-friendly AI to help improve AI systems without compromising on privacy. That is why we have built BlindAI, an open-source and privacy-friendly solution to deploy AI models with end-to-end protection.

We will see in this article how a state-of-the-art Speech-To-Text (STT) model, Wav2Vec2, can be deployed, so that users can leverage AI without any worrying that their conversation could be heard by anyone else.

I - Use case

A - Deep learning for Speech-To-Text

We will see with a concrete example when and how to use BlindAI.

Let’s imagine you have a startup that has developed a state-of-the-art speech-to-text solution, to facilitate the transcription of medical exchanges between a patient and her therapist.

Enormous amounts of information come out during a therapy session, so it can never be fully transcribed and much of it gets lost. It can then become quite frustrating for patients to realise that part of their sessions is forgotten or never put down on paper, because their therapists did not have time to write down everything. Recording everything can be quite challenging for them as they must handle multiple 1-hour long sessions all day long.

Therefore, the startup proposes a Speech-To-Text AI solution, which would enable the therapist to focus less on note taking, create much more reliable data that can be leverage for research, and help patients get better care.

B - Deployment challenge

One natural way for the startup to deploy their AI for therapy transcription is through a Cloud Solution Provider. Indeed, providing all the hardware and software components to deploy this solution is quite complex and costly, and not necessarily the main focus of the startup.

In addition, deploying their AI as a Service makes it easy to integrate with existing solutions, and facilitates onboarding as little is asked of therapists.

Nonetheless, deploying such a solution in the Cloud creates privacy and security challenges. Respecting session confidentiality is key to maintain patient-doctor privilege but it is hard to provide guarantees that therapy sessions are never accessed.

Malicious insiders, such as a rogue admin of the startup or the cloud provider, or even a hacker having internal access, could all compromise this extremely sensitive data.

Because of these non-negligible risks and the lack of privacy-by-design AI production tools, deploying this AI for therapy sessions in the Cloud can become highly complex.

C - Confidential AI deployment

However, by using BlindAI, voice recordings of sessions can be sent to be transcribed by an AI in the Cloud, without ever being revealed in clear by anyone else. By leveraging secure enclaves, it becomes possible to guarantee data protection from end-to-end, even when we send data to a third party hosted in the Cloud.

Before and after BlindAI

We see in the above scheme how regular solutions could expose session recordings to the Cloud provider or the startup, and how BlindAI can provide a secure environment making data always protected.

We cover the threat model and the protections provided by secure enclaves in our article Confidential Computing explained, part 3: data in use protection.

The main idea is that secure enclaves provide a trusted execution environment, enabling people to send sensitive data to a remote environment, while not risking exposing their data.

This is done through isolation and memory encryption of enclave contents by the CPU. By sending data to this secure environment to be analysed, patients and therapists can have guarantees that their data is not accessible neither by the Cloud Provider, nor the AI company, thanks to the hardware protection.

Therefore, both can benefit from a state-of-the-art service without going through complex deployment on premise and adopt a solution that they do not need to maintain themselves. All this, while keeping a high level of data protection as their data is not exposed to the Service Provider or the Cloud Provider.

II - Deployment of confidential voice transcription with BlindAI using Wav2vec2

Now that we have talked about how secure enclaves can be used to deploy models on sensitive data, especially in a Public Cloud setting, we are going to see how to do it in practice with BlindAI.

The goal here is to deploy an AI inside an enclave, so that people can send data to be transcribed by it, without ever exposing the audio data to anyone in clear, thanks to enclave protections.

Workflow with BlindAI

As we did in the article Deploy Transformers with confidentiality, we will follow the same steps to deploy and consume an AI model:

  • Run our inference server, for instance using Docker.
  • Upload the ONNX model inside the inference server using our SDK. By leveraging our SDK, we make sure the IP of the model is protected as well.
  • Send data securely to be analysed by the AI model with the client SDK.

In the same fashion as other examples, we will only show you the simulation mode. Simulation enables to test the server side on any machine, but without the security guarantees that machines with Intel SGX in hardware mode provide.

If you want to run BlindAI in hardware mode, you will need supported hardware, and you will need to install the proper Intel drivers. Learn more about them in this documentation page.

For this use case where we want to perform Speech-To-Text, we will use Wav2vec2. Wav2Vec2 is a state-of-the-art Transformers model for speech. You can learn more about it on FAIR blog's post.

A notebook containing all the steps is available here.

A - Launch server

The first step is similar to the Transformers' article, we just need to deploy our server using our Docker image.

docker run -p 50051:50051 -p 50052:50052 mithrilsecuritysas/blindai-server-sim
Deploy our simulation Docker image for the inference server

B - Upload model

Because BlindAI only accepts AI models exported in ONNX format, we will first need to convert the Wav2vec2 model into ONNX. ONNX is a standard format to represent AI models before shipping them into production. Pytorch and Tensorflow models can easily be converted into ONNX.

Step 1: Prepare the Wav2vec2 model

We will load the Wav2vec2 model using Hugging Face transformers library.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
Load model and processor for Wav2vec2

In order to facilitate the deployment, we will add the post processing directly to the full model. This way the client will not have to do the post processing.

import torch.nn as nn

# Let's embed the post-processing phase with argmax inside our model
class ArgmaxLayer(nn.Module):
    def __init__(self):
        super(ArgmaxLayer, self).__init__()

    def forward(self, outputs):
        return torch.argmax(outputs.logits, dim = -1)
 final_layer = ArgmaxLayer()

# Finally we concatenate everything
full_model = nn.Sequential(model, final_layer)
Add postprocessing to the model we will export

We can download an hello world audio file to be used as an example. Let's download it.

Get "Hello world" audio sample

We will need the librosa library to load the wav hello world file before tokenizing it.

import librosa

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values
Load and preprocess audio file

We can then see the Wav2vec2 model in action:

>>> predicted_ids = full_model(input_values)
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription
Inference result

Step 2: Export the model

Now we can export the model in ONNX format, so that we can feed later the ONNX to our BlindAI server.

    opset_version = 11)
Export to ONNX file

Step 3: Upload the model

Now we can simply upload to our backend in simulation mode. Here we need to precise that inputs are floats and outputs are integers.

from blindai.client import BlindAiClient, ModelDatumType

# Launch client
client = BlindAiClient()

client.connect_server(addr="localhost", simulation=True)

Upload model to BlindAI

C - Get prediction

Now it's time to check it's working live!

As previously, we will need to preprocess the hello world audio, before sending it for analysis by the Wav2vec2 model inside the enclave.

First we prepare our input data, the hello world audio file.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values
Loading and preprocessing of audio file

Now we can send it to the enclave.

from blindai.client import BlindAiClient

# Load the client
client = BlindAiClient()
client.connect_server("localhost", simulation=True)

# Get prediction
response = client.run_model(input_values.flatten().tolist())
Sending data to BlindAI for confidential prediction

We can reconstruct the output now:

>>> processor.batch_decode(torch.tensor(response.output).unsqueeze(0))
Response decoding


Et voila! We have been able to apply a start of the art model of speech recognition, without ever having to show the data in clear to the people operating the service!

If you have liked this example, do not hesitate to drop a star on our GitHub and chat with us on our Discord!