Open Source Is Crucial for AI Transparency but Needs More Tooling

Open Source Is Crucial for AI Transparency but Needs More Tooling

AI model traceability is crucial, but open-source practices alone are inadequate. Combining new software and hardware-based tools with open sourcing offers potential solutions for a secure AI supply chain.

Daniel Huynh,
Jade Hardouin

In a previous article, we showed how to hide a malicious model on a model hub to spread false information. 

This poisoning is today hard to prevent as there is no industry-wide solution to provide provenance of specific weights, aka what data and code were used to produce a model. In the absence of such traceability, attackers could hide malicious behaviors in model weights to spread misaligned AI models.

Key Takeaways:

  • The industry has no solution today to provide technical proof of provenance of AI models, aka what code and data were used for training...
  • Open source is a necessary condition but is not sufficient to solve that issue.
  • Different solutions, software or hardware-based, can be used to provide certain levels of guarantees to provide provenance of AI models. 
  • They still rely on scrutiny of the code and data by outside verifiers, making the combination of open-source and traceability toolings the most reliable solution for a transparent AI supply chain.

In this article, we will explain why open-sourcing is necessary to make AI transparent, but we’ll see that traceability tools are required to make it fully efficient. We will show how recent hardware and software-based solutions complete open-source approaches to build a robust AI supply chain.

Why AI traceability is critical

With the growing popularity of Large Language Models, concerns have emerged regarding transparency and safety. Indeed, AI models are shipped today as black-box binaries, just a bunch of floating points tied together in several matrices.

We understand little about how they work, and executing them today is the equivalent of executing an unknown binary. As you might expect, this can lead to dire consequences. Neural networks can hide arbitrary behavior in their weights, and the neural network's output can be unsafe or pose security issues.

Several papers have shown that, for instance, it is possible to hide backdoors inside the weights of a neural network. In our PoisonGPT example, we modified the weights to change that the first man on the moon was Yuri Gargarin instead of Neil Armstrong. While this seems rather benign, much worse can be done. Ransomware or other malicious code could be suggested or executed by an LLM installed locally, which would have terrible outcomes. 

Knowing the provenance of AI models with certainty will also help combat copyright issues, as several major AI providers face trials as they are suspected to have heavily used copyrighted data in their training. It will also be demanded by the EU AI Act that AI providers disclose and document the provenance of their training set and are fully transparent about the training procedure.

Therefore to trust the outputs of AI models, knowing their provenance, aka the code and training data used to produce the model, is of utmost importance. However, there is no widely available tooling available on the market to produce that kind of proof.

Note that this problem of the software supply chain is not new, and the software engineering community is also actively working on building a secure supply chain to trace the provenance of software. The SolarWinds episode highlighted the software supply chain issue where backdoors were injected early in the build process.

For instance, the SLSA project is an industry-wide effort to create standards for more integrity in the build process.

But how are AI models different from regular software? Well, you can decompose an AI model in two simple parts: the code to run the inference, for instance, Hugging Face, ONNX Runtime, PyTorch, etc., and the actual weights that are used. While the code of the inference engine can be inspected using regular software practices, it becomes much harder to audit the weights. 

The initial approach is to examine the model without training data and code to try to derive properties. Static methods cannot analyze weights, which are just a specific configuration of floating points in a series of matrix multiplication. Extracting some property statically seems quite challenging.

Dynamic approaches could be used, where different benchmarks are used to see if the model behaves in a certain way. While providing a better sense and some scores of the model against some understandable metrics, backdoors can still be hidden. One can, for instance, surgically modify the model to behave in a malicious way without changing the rest of the behaviors, or the backdoor insertion could also try to minimize changes in these benchmarks during the edition phase.

Therefore, trying to get some sense of an AI model as is without information about the code and data used for training is tricky. Hypothetically, if we have guarantees that a model results from applying a specific code on a specific dataset, we just need to audit the code and data. If the dataset contains no bias or malicious artifact, and the code has no backdoors, then the resulting model should behave properly. Of course, there are alignment issues, but those are beyond the scope of this article, which focuses on traceability.

Many open-source initiatives have worked on making both code and data used for training more transparent, which partially solves this issue, but unfortunately, this is not enough to have a high level of traceability to create more trustworthy AI.

Why isn't open source alone sufficient for model traceability?

To understand why open-sourcing the code and data is insufficient in practice to make AI models traceable, let’s try to understand how they could theoretically make the model traceable.

Imagine Alice has a model M, coming from dataset D and code C. Alice tries to convince Bob that her model results from applying C on D, which gave M.

The only way for Bob to ensure her claim is right would be for Alice to make M, D, and C openly available to Bob for him to reproduce her results. 

If Bob applies the same code C on dataset D and ends up finding the exact same model M, then it means M indeed comes from C and D. Indeed, given the space of all possible models, the probability that Bob arrives at M with a different C or D is extremely unlikely.

By reproducing Alice's experiment, Bob can assert that M is indeed the result of applying C on D. Inspecting the trustworthiness of C and D confirms the trustworthiness of M. This concept is at the heart of science: it’s the reproducibility that enables us to prove the claim of Alice.

However, two major issues make reproducibility extremely difficult or impossible, even if data and code are made open:

  • The first issue relates to randomness in software and hardware. This means that even with the same code and data, running the training process multiple times can lead to different results. The models rely on complex algorithms with random initialization and sampling methods, introducing variability. Note that this is not unique to AI but is accentuated in AI because of the use of GPUs, which can have unpredictable behaviors due to factors such as memory access patterns, parallel processing, and optimization techniques. 
  • The second issue relies on ecological and economic reasons that interfere when reproducing experiments. Even if it were possible to reproduce a model with the same exact parameters as the original, not all companies have the resources to train large-scale models themselves. From an environmental perspective, the extensive computational power and energy consumption involved in reproducing AI models is an issue. For instance, the cost of reproducing Llama 65B is estimated at 5 million dollars.  

These two issues make AI reproducibility quasi-impossible. Claims about model provenances are non-falsifiable! No matter how many attempts Bob makes with code C and data D, he won't find M.

This non-falsifiability is a big issue, as honest AI model builders cannot prove that they used an honest dataset, for instance, a non-copyrighted one without bias and PII. Conversely, malicious actors can pretend their backdoored model is from trustworthy sources, and no one can prove them wrong!

New traceability tools could bridge the gap

Therefore, open source needs additional tools to achieve more transparency about model provenance. We need a method to connect the model's weights to the data and code used during training. To understand how this could be achieved, sausage is a good analogy

We consider a sausage as a model. Like a sausage comprises different parts and goes through various processes, an AI model consists of components like weights and datasets that undergo training and tuning. As a consumer, your primary concern is that the product (or model) is safe to eat (gives you accurate and harmless outputs). 

The United States Department of Agriculture, USDA, stations individuals throughout the whole process, from the ranch to the factory, to watch for any malicious tampering or genuine human error. We can draw a parallel to the AI context, where a tool can keep track of all processes accomplished during the model's training. Then, if anyone tries to poison the model, this tool provides traceability of the actions performed

There are mainly two families of approaches to creating such tools that overview each step of the training process: software-based ones, which rely on mathematics, and hardware-based ones, which rely on secure hardware, such as Trusted Platform Modules (TPMs).

Software-based approach

As models can be hard to reproduce exactly, it is more tractable to check that the reproduced weights are close enough to ensure it is likely during each step of the training process.

For instance, the paper Tools for Verifying Neural Models’ Training Data provides an overview of how such approaches could work.

One way is, for example, to do segment-wise retraining. Rather than retraining the full model, we can ask the model developer to supply a sequence of intermediate checkpoints that led to the model. Then, a verifier could retrain only brief segments between pairs of checkpoints, starting from one checkpoint, retraining, and confirming that the resulting weights were close to the reported second checkpoint. This sort of check can ensure that most of the segments in the training run are accurate.

Instead of re-executing training, an even cheaper and complementary strategy is to simply check that the data points were memorized by the model. In this case, memorization (“overfitting,” as it’s known in machine learning) works in our favor: if the model overfits to particular data points, that’s a trace we can use to confirm that the model was trained on that data. By looking at a small random subset of the training points in a segment and confirming that they were all somewhat memorized, we can confirm that nearly all the reported data points were included with high probability.

This still leaves a few challenges, such as confirming that a model wasn’t trained on a tiny amount of additional data or otherwise faked in ways that fool our barrage of tests. To counter this, we may need to turn to hardware.

Hardware-based approach

Another approach is to use secure hardware, such as Trusted Platform Modules (TPMs), to ensure the integrity of the whole chain. Indeed, such devices have the property of being able to attest the whole stack used for producing the model, from the UEFI all the way to the code and data through the OS. 

The TPM PCRs (Platform Configuration Registers) are a set of registers within the TPM that store measurements of system configuration and integrity. They can be considered a log of the system state, capturing the integrity of various components during the boot process and other critical stages. The PCRs are typically used to attest to a system's integrity or verify that the system has not been tampered with.

When a system boots, various measurements are taken, such as hashes of firmware, boot loaders, and critical system files. These measurements are then stored in the TPM PCRs. The values stored in the PCRs can be compared against known values.

Measuring the whole hardware/software stack and binding the final weights produced (by registering them in the last PCR) allows the derivation of certificates that contain irrefutable proof of model provenance. 

Combining traceability tools and open source for traceable AI

While software and hardware-based tools have the potential to solve the provenance issue, they both rely on the same assumptions: the verifier should have full access to the training code and data to make sure the model is indeed a result of those.

Those tools only cover binding the weights to the data and code, which can create accountability. If AI builders were malicious and poisoned the code and/or data, this can be detected as a posteriori. However, this only works after a review of the code and data, which offers no a priori security. Therefore, constantly reviewing the code and data by the community can help turn this a posteriori security into a priori security. Thus, adopting an open approach where the whole process is made visible to everyone helps create trust in those traceability tools, no matter if it's software, hardware-based, or both.

This does not mean that it is impossible to use those traceability tools when the Intellectual Property (IP) of the code and data are sensitive and cannot be disclosed publicly. If we imagine a scenario where a model has been trained on confidential data with a proprietary code, It is still possible to derive some level of trust.

This would require a trusted third party to audit the process, for instance, a regulatory body. After verifying the model using traceability tools, this entity can then commit and sign a report saying that a specific set of weights is deemed trustworthy.

If we suppose this model is then served through a SaaS to avoid exposing the weights by distributing them on-premise, then several issues still remain:

  • What proof do we have that the model loaded in the backend is actually the one that was audited?
  • Even if the auditor is of good faith and competent, it is highly likely that its scrutiny will be as strong and systematic as the scrutiny of the whole community.

Therefore, it is likely that the best solution to build the most robust and trustworthy AI supply chain would be to combine the latest AI traceability solutions with an open approach with the scrutiny of whole communities.


We have seen in this article why AI provenance is a key topic to building trust in AI systems, as the universal GIGO (Garbage In, Garbage Out) applies.

Being open to the data and code is an excellent first step but is insufficient from a technical point of view due to reproducibility issues. New hardware and software tools could bridge the gap and ensure that models are traceable. Combining visibility on the data and training used to ensure it is trustworthy and binding those to the model actually produced helps create a solid foundation for a secure AI supply chain.

At Mithril Security, we aim to build such traceability tooling using secure hardware. Our upcoming open-source project, AICert, will leverage TPMs to create AI model certificates where the model weights can be cryptographically bound to the code and training data.

Want to make your AI SaaS secure and traceable?

Image credits: Edgar Huneau