How to Prevent Cheating in AI Tests
Cheating AI test methods include by swapping models or overfitting on known test sets. A solution to AI test cheat would be a secure infrastructure for verifiable tests that protects the confidentiality of model weights and test data.
The surge of global LLMs adopted for a wide variety of critical use cases raises legitimate concerns about the security and accuracy of these models. While tests exist to evaluate the models performance and safety, there are insufficient safeguards against "cheating" techniques such as model swapping, post-test modifications, or deliberate overfitting on known test data.
In this article, we will outline a technical solution to prevent AI test cheating with a secure infrastructure that enforces tests that are both verifiable and preserve the confidentiality of test assets (model weights and test data).
Why AI Testing is so Important
The rapid adoption of AI in high-stakes domains (it would be a challenge to name a domain NOT trying to use LLMs on any tasks today) raises many concerns about the reliability of AI models. Assessing a model's behavior requires new techniques and tests compared to classic software. Indeed, from a human perspective, AI models are opaque—black boxes whose reasoning cannot be directly inspected. Their billions of parameters cannot be reviewed like traditional software code. Additionally, the virtually unlimited use case makes it less relevant to perform a test on a model on a given use case, as we could for standard software.
In the past year, numerous organizations dedicated to testing AI models have emerged in response to the growing need for new methods to assess model behaviors. Their mission is twofold: evaluating model security and assessing performance. Various entities, including AI Safety Institutes, corporations, and independent evaluation firms, are working on this crucial task.
Developing valid AI safety tests is essential, as controlling LLM behavior remains a critical challenge in the current AI revolution. For instance, a key concern is ensuring that users cannot exploit models to extract confidential data. While this complex topic has attracted the attention of many experts in the field, AI tests are often criticized for being fallible due to insufficient safeguards against forms of cheating.
Key Methods of AI Test Manipulation
In this section, we'll detail two key ways AI builders could cheat tests: overfitting models on test sets and model swapping.
Swapping Models Post-Testing
The sheer size and internal complexity of large AI models create a significant challenge in verifying their integrity after testing. This makes it possible for an unethical or malicious actor to swap out a tested model with a different version without detection.
As we demonstrated last year with our PoisonGPT experiment, two versions of the same model can appear indistinguishable at first glance—they may have the same performance, size, and nearly identical characteristics but contains subtle differences (in our example, our model provided misinformation that, "Yuri Gagarin was the first man on the moon").
This opacity makes it virtually impossible to verify if the model deployed in production is the one that was tested, creating a significant trust issue. Users must rely on the provider's integrity regarding whether the tested model is the one in production.
Overfitting on Test Data
AI models, especially large language models (LLMs), have a tendency to memorize portions of their training data. The first thing you learn when you follow any AI course is not to train AI models on the test set. Otherwise, the model over-adapts to the test data set. It will perform very well on the test data, but this performance will not reflect the model's overall capabilities outside of this test. This is known as overfitting.
Where they have access test data (whether performance or security related), an AI builder could be tempted to overfit their model for this test. The ease of overfitting on a training set is particularly problematic for security challenges like AI jailbreaks (bypassing built-in model safeguards), where changing a model’s structural behavior is difficult, but applying “dirty patches” through overfitting on known failure cases is quite simple.
We consider overfitting for a test to be “cheating” because this “quick fix” improves test results without improving the model's inherent security.
AI model test cheating is already happening
While proving that a provider has swapped models remains challenging, there's clear evidence that overfitting on test sets has occurred in various instances.
Studies have revealed that Meta's LLaMA model exhibited strong performance on mathematical benchmarks like GSM1k due to likely data contamination, where test data leaked into the training set. This data contamination causes the model to memorize answers rather than solve problems. While concerns about the reliability of public benchmarks due to data contamination are widespread, there is little clear evidence (for now) because there is no standardized, widely accepted method to detect contamination effectively.
This problem is even more apparent in security-related areas, such as AI jailbreaks. Many prompt lists that once successfully bypassed the safeguards of LLMs no longer work… unless you slightly alter the prompts by changing the language or introducing noise! This simple experiment, which anyone can conduct, shows that AI engineers may be tempted to apply quick fixes rather than address security issues at their core.
Those examples likely represent just the tip of the iceberg. While some cases of overfitting on test sets have been clearly demonstrated, suspicions have been raised about other released models. However, due to the inherent difficulty of proving exactly what data a model was trained on, concrete evidence of cheating still needs to be discovered in many instances.
Current Challenges in AI Model Testing Infrastructure
AI models are typically evaluated using numerous test sets designed to assess various aspects of performance or security. However, organizations conducting these tests, such as AI Safety Institutes or certification companies, face significant infrastructure-related challenges. The two main testing scenarios available both come with their own limitations and risks:
- Model Deployed on Tester's Infrastructure: Where the model in question is deployed on the tester's own secure infrastructure, they can keep the test set confidential, preventing overfitting by ensuring the AI provider cannot access the test data. However, this method requires the tester to have access to the model's weights. While this approach may work for open-source models, for closed-source models, it requires the AI provider's trust. This is a risk that many AI providers don’t want to take with testers independent of their organization. Additionally, the tester would need the computational capacity to run the model properly.
- Remote Testing via Provider's API: In this setup, the AI provider grants the tester remote access to their model via an API. While this is easier to implement, it is possible that the provider could access the test set and retrain the model based on it, compromising test integrity. Additionally, testers cannot verify which version of the model they are interacting with. This method relies on trusting the AI provider not to access the test set to overfit its model.
Given these limitations, doubts surrounding the performance claims of certain AI models are understandable. Overfitting on previous test sets can lead to artificially inflated metrics, making it difficult to trust the results. To deal with the cheating, testers are forced into a cycle of continually developing new tests to ensure that models haven't just memorized previous inputs.
This points to a critical gap in the current testing ecosystem: where the model is deployed on the AI provider's infrastructure during testing, there is no reliable way to verify that AI providers haven't accessed the test data, nor can we be certain that the model deployed in production is the same one that was rigorously tested.
Our Proposal to Combat AI Test Cheating: Verifiable Tests that Preserve Asset Confidentiality
To address AI test cheating issues, we propose developing a system that ensures both test confidentiality and model integrity. Fortunately, the technology Mithril leverages for all its products—secure enclaves—can provide both code integrity and data confidentiality features. We simply need to adapt our toolkits for AI testing. Enclaves provide secure, isolated testing environments that keep both the model and test data confidential while also providing cryptographic proof that the model tested is the same one deployed.
Preventing Overfitting With Test Confidentiality
Overfitting happens when the model builder can access the test set, allowing it to memorize specific answers. To prevent this, test queries must remain hidden from the model provider, which is where secure enclaves play a crucial role. With our secure test solution, enclaves will safeguard the confidentiality of the test data, preventing AI providers from accessing or manipulating the test set.
Ensuring Model Integrity
Model substitution is a critical issue, where an AI provider could swap the tested model for a different version when moving to production. Secure enclaves can solve this problem by generating cryptographic certificates that bind the specific model weights to the test results. These certificates provide verifiable proof that the model deployed is the one that underwent testing. Any changes to the model after testing would invalidate the certificate, ensuring that only the tested version can claim test compliance.
Next Steps
We will develop this solution in the coming months. We already have a framework for confidential AI with BlindLlama (i.e., an AI workload protecting both the model and the input data) and model traceability with AICert. We will combine these properties to create an infrastructure for verifiable AI tests with guaranteed test set confidentiality and model integrity.
How will it work?
During testing, the AI provider submits their model by uploading its weights to a secure environment. Next, the auditor runs the tests confidentially by submitting queries to the model without directly accessing the model's weights. The secure environment ensures that neither the model's weights nor the test queries are exposed. Once the tests are completed, the auditor evaluates the results and issues a cryptographic certificate that verifies the model has passed the required benchmarks.
In production, the AI provider uses this certificate to prove to users that the model has been properly tested and meets the required standards. Users can verify the authenticity of the model using the certificate. Throughout the process, the confidentiality and integrity of both the model and the tests are protected through the use of secure enclaves and cryptographic tools.
Subscribe to our blog if you want to be kept updated about the latest developments of this project and hear about our next projects.