A “minimum testing period” for frontier AI

When rushing wins, we all lose. Safety shouldn’t put you behind.

Apr 16, 2025

In 1994, a massive fireball engulfed the Benetton racecar team. These were the world’s top technicians—ordinarily they can refuel a car easily, even a piping hot one. So what gives?

Benetton was refueling without a necessary filter. Faster gas intake, but higher risk. In a competitive setting, the attraction is understandable: Formula 1 races are often won by seconds. If you don’t push the limits, you’ll be left behind. A dangerous dynamic.1

But it’s important that teams can take safety seriously without being disadvantaged.

Many racing leagues now have rules to stop teams from cutting time at the expense of safety.2 Minimum refueling times so teams can’t gain from pumping at higher and higher pressure. A mandatory pit stop for a safety inspection.

If we want AI companies not to rush through safety testing, we need to fix similar competitive dynamics, with real rules and incentives—like laws with financial penalties for skimping out. An unsafe AI model poses risk to us all, not just the builders who were rushing.3

One policy worth exploring is a “minimum testing period”—a required window between a model being fully trained and when it can be deployed, either internally or externally.4

In this post, I will:

Explain why safety testing is important.
Describe evidence of rushed testing by AI companies.
Explain the benefits of a minimum testing period (and how this can be achieved).
Consider the tradeoffs of a minimum testing period.

Let’s not end up like this. Photo credit: Arthur Thill

Safety testing reveals what a model can do

Though you might be surprised to hear it, a model’s capabilities are largely a mystery—even to its developers—until the testing phase.

Safety testing is how AI companies discover their models’ abilities. Abilities that were crudely shaped from a soup of data—not programmed in directly.

“The unexpected capabilities problem.” That’s what researchers at OpenAI, Google DeepMind, Microsoft, and other leading institutions have called this—the challenge of not knowing what abilities a model will have, or when it will develop them.

Described playfully as “a fun guessing game,” in the words of one AI company CEO.5

Will a model improve as you throw more computer power and data at it? Generally yes. In exactly what ways? Nobody really knows.6 AI is different from traditional software in this way.

For some abilities, not catching these until after deployment wouldn’t be such a big deal. But for others—like measuring whether a model is able to break out of an AI company’s data centers—finding out only in hindsight could be too late. If testing is rushed, researchers have less chance to catch any concerning signs.7

One challenge here is that safety testing is essentially still “pre-paradigmatic”: There isn’t agreement on what abilities to test and how to measure these.

The result is that it’s hard to prove that a model is safe enough. There’s no simple standard that researchers would all agree upon—no consensus “you must prove that your model is incapable of X, Y, and Z.”8 If there were, then just achieving those criteria might suffice.9

But even so, reliable evaluations—ones capable of assessing a model in high-fidelity, realistic settings—take significant time. Any sufficient testing protocol will need not to be rushed. And as models become more capable, the necessary amount of testing time is likely to go up.10

Particularly until we have clear enforceable standards, it’s important to give safety testing room to breathe. Researchers need enough time to do the underlying science, to understand exactly what the model can do, and to figure out if it is actually dangerous.

Safety testing is already challenging enough; doing it under large time pressure is just asking for trouble.

Rushed safety testing is common today

Once, AI companies bragged about the extensiveness of their safety testing. OpenAI famously did six months of testing and improving upon its safety mitigations before launching GPT-4.11 This investment was possible in part because of OpenAI’s perceived lead over other AI companies.

Now, the race is much tighter, and no AI company has the time to burn. Some are releasing versions of models that were hardly tested, or skirting promises they made to conduct safety testing and publish the results.12 Internal safety teams know that this is risky: “We basically failed at the process,” said one member of OpenAI’s safety team last summer.13

Consider Meta’s recent Llama models: These were made freely available for download “as soon as they were ready,” according to Meta’s VP of generative AI. This happened to fall on a Saturday—with the Meta team still “working through our bug fixes.”

Because safety testing isn’t required by law, firms can quietly revise or drop their own commitments—even those made to the US government—without notifying the public.14

Amid this free-for-all, some companies have been more forthcoming with reporting changes to their safety processes—most notably Anthropic, which deserves acknowledgement for communicating changes transparently.15

But by and large, AI companies rushing through safety testing is a recurring issue.16 Without common safety standards, it is a natural consequence for companies to cut corners in their own self-reported, self-enforced testing.

The predictability of the rushing dynamic does not excuse the companies rushing through. But it does mean we’ll need to change the dynamic if we want safety testing to improve.

A “minimum testing period” can help

One way to decrease companies rushing their safety testing is to have a “minimum testing period”—a minimum duration of testing that AI companies must run on their frontier models.17

Until the period is complete—say, 1-2 months following the completion of the model’s complete training—the company would be legally restricted from deploying its model, for meaningful uses internal or external to the company.

The core idea: a minimum testing period reduces the incentive to rush through safety testing.

A minimum testing period insulates safety testing from competitive pressures: Companies don’t have to worry about getting undercut by competitors who do fewer tests. Each company now has less to gain from cutting testing short. Without a minimum testing period, safety-conscious companies are essentially penalized for moving cautiously, compared to the splashier companies that move faster. Eventually there might be a correction if a reckless company causes a serious safety incident. But this still might not make up for the cautious company’s weaker position,18 and the incident might have serious negative externalities (e.g., major harms to third-parties).
A minimum testing period reduces the delay-cost of deep scientific testing: If you are trying to launch a model as quickly as possible, doing deep testing might require you to delay your launch plans. If your results look suspicious, you have a dilemma: Want to investigate further? You risk delaying launch. Think of a new ability worth testing? Same issue. Particularly for more junior researchers on safety teams, they might hold a higher bar for raising concerns that have the potential to derail a company’s launch: An imminent launch can have huge psychological impacts.19

Beyond changing the incentives of safety testing, a minimum testing period also makes other safety measures cheaper. An example is to require companies to release their safety test results before deployment. When companies are rushing to get a model out the door, it is hard to prioritize a public writeup (or sometimes hard to even prioritize the testing). Instead, companies release safety test results only once a model has been made public—or some amount of time afterward, or never. If there has been a mistake in the testing, real risks might be possible because the model has been deployed.20 This is an unfortunate delay, because public commentary has a track record of flagging important issues with companies’ safety testing.21

For a minimum testing period to succeed, nailing the details—what models are subject to this, how we measure duration, etc.22—still needs work, and I would be interested in more groups tackling these ideas.

Downsides of a “minimum testing period”

Having laid out the benefits, I should be clear that a minimum testing period isn’t a perfect policy—though I think it’s likely a good start.

That said, what are some disadvantages?

Minimum testing periods are not a panacea. Having a minimum amount of time doesn't guarantee that a company will use this time well, or will ultimately treat risks with seriousness. Companies that turn up particularly concerning evidence during testing might have to delay their launches, which many people at the company will not want. Even with a minimum time window, this problem is not fully solved—particularly while we lack consensus on what abilities to measure, and how to do so.
Minimum testing periods could reduce the incentive to innovate in testing. If a company figures out how to do rigorous testing in less time, this is great. Ideally this innovation would be rewarded. But with a minimum testing period, some of the upside isn’t captured; the company still needs to wait. Accordingly, companies might invest less in figuring out creative ways to quickly get rigor.
Minimum testing periods might have geopolitical consequences that are hard to anticipate. One possible concern is that minimum testing periods could benefit geopolitical adversaries like China, who might not implement a minimum testing period of their own. This is likely the strongest objection.23 Still, there are a few questions worth disentangling:
- What is the impact on when a company develops its next frontier model?
  - A one-month testing period for Model 1 doesn’t mean that the company’s next model (Model 2) would be delayed by exactly one month. We should distinguish between the duration of the minimum testing period for the current-gen model, vs. what delay it causes in a future next-gen model.
- How willing should we be to trade some “development lead” for higher safety assurances on our models?
- How does our adopting a minimum testing period affect the likelihood that other nations do as well?
- My overall inclination is that a minimum testing period (say, 1-2 months) wouldn’t slow down new model development by very much—and that even if it did, this is a worthwhile cost to bear. Of course, others might have different intuitions here; I’d be interested in reading more detailed modeling of these scenarios.

How minimum testing periods could happen—and complements to consider

One de facto way of achieving a minimum testing period: Require that governments be given advance access to test a frontier AI system, and that the system cannot be deployed until testing is completed. If the government’s testing always takes at least one month, then there is now a de facto one-month-long minimum testing period.

A more direct way, of course, is just to mandate the testing period in something with force of law, like the European Union’s “General-Purpose AI Code of Practice.” Companies that eventually sign the Code of Practice—it is nearing completion—will be susceptible to large fines for violating it.24 Some AI companies might decline to sign this, however, or might limit where they make their products available in hopes of avoiding EU jurisdiction.

Another option is for an industry standards group—most likely the Frontier Model Forum (FMF) or the Partnership on AI (PAI)—to advance a specific standard related to minimum testing periods, which member organizations are expected to follow.25

In any case, we might not want to rely on companies publicly calling for this directly. Even if a company does prefer to go slower (so long as others do too), no company wants to say, “We’re moving too fast.” Still, I’d hope some would call for this, perhaps grounded in other companies going too quickly, without casting judgment on one’s own pace.

Beyond different ways of achieving a minimum testing period, there are also similar policies that could be considered—for instance, minimums on the number of employees dedicated to safety testing, or the total number of labor hours spent on testing. Essentially, these are rules requiring “minimum safety resourcing,” in absence of being able to regulate sufficient outcomes.26 Another approach is to make sure that safety testers have an “Andon cord” for safety testers to anonymously stop deployment and flag issues to management, which can buy more time if needed.27

Surely some companies would use creative accounting to justify having met these requirements, even if the truth is murkier.

Nonetheless, if we want to stop riskier AI companies from undercutting others on safety, we might not have better options than to absorb those tradeoffs.

Acknowledgements: Thank you to Catherine Brewer, Henry Sleight, Luca Righetti, Michael Adler, Sydney Von Arx, and Zach Stein-Perlman for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party. All of my writing and analysis is based solely on publicly available information.

If you enjoyed the article, please share it around; I’d appreciate it a lot. If you would like to suggest a possible topic or otherwise connect with me, please get in touch here.

For an oral history of the fireball incident, see here. For a review of the ensuing investigation, see here. The Benetton team maintains that they did not cheat. It seems that multiple teams competed that day without the necessary filter.

Some examples: The International Motor Sports Association has minimum refuel times for its SportsCar Championship series. The FIA requires a minimum duration of technical pit stop during Endurance races like 24 Hours of Le Mans. Of course, these rules vary by league and racetype; not every racing association has these. For instance, Formula 1 has alternated between whether to allow refueling at all during races (for strategic considerations, in addition to safety) and currently fully disallows refueling.

This point isn’t unique to AI: Many technologies pose risk to people other than their builders, like if an airplane were to fall out of the sky. If you’re interested in arguments specific to AI—indeed, how AI could do quite a lot of damage—I recommend this post.

One challenge in defining these policies is that “deployment” is a fuzzy concept. When I mention deployment in this piece, I am referring to both internal and external deployments, with an expectation that internal deployments will be more challenging to govern (but perhaps more important—e.g., if an unsafe model is being used to edit the company’s codebase). This is in line with AI safety advocates who increasingly call for company-internal use of a model to be considered deployment (“internal deployment”). One possible definition of deployment is when a model becomes available for actual productive use, rather than just for testing.

The same CEO said, in the 2023 interview, “We’re trying to get better at it, because I think it’s important from a safety perspective to predict the capabilities.” I applaud this vision, but I am not aware of research from this company related to predicting safety-relevant capabilities of its models.

For more detail, see Anthropic’s Predictability and Surprise in Large Generative Models for an explanation of how models’ abilities can emerge abruptly without much forewarning. This is true even despite there being some metrics that do change fairly predictably as a model grows—like its ability to accurately predict the next word of generic internet text.

Some AI companies tout “iterative deployment” as a way for the world to encounter AI models and uncover issues (in hopes of then fixing these issues for the next deployment). This approach has some merits, but it’s important to make sure 1.) that issues are actually fixed over time, and 2.) that an issue isn’t large enough such that it can’t be fixed after-the-fact.

In contrast, automobile safety is substantially better understood than AI safety—and so the automotive industry has been able to converge on certain outcomes-based safety standards: mechanisms and measurements for crash avoidance, crashworthiness (how protected an occupant is during a crash), and post-crash survivability. Still, even in well-understood domains, competitive pressures can still cause skirting of standards—such as with the Volkswagen emissions scandal.

For a set of tests that are sufficient for proving a safety standard, you might think of this as a “minimum testing protocol”: the minimum right set of tests being done, with the minimum right set of resourcing. This allows a company to make a successful “safety case” (ref, ref).

A minimum testing period—the amount of time for testing—is just one possible resource in this protocol, alongside other factors like: “Did testers get access to a model’s internal chain-of-thought?”

For an overview of limitations that testers face when evaluating a new frontier model, see METR’s report on o3, which says “limitations in this evaluation prevent us from making robust capability assessments.”

Of note, METR writes that lacking access to o3’s chain-of-thought was a more significant limitation in this case than the “relatively short time” provided for evaluations. A “minimum testing period” is not necessarily the most important resource to focus on, though it does have some benefits of clarity and standardization.

As models improve, evaluations that are quick and simple—like multiple choice Q&A—will help less in distinguishing between the abilities of different models. Already, researchers often say that models have “saturated” many evaluations that exist (“fully solved” them, more or less). We’re going to need much harder evaluations, like open-ended environments that test complicated agentic behavior. But we shouldn’t expect complicated evaluations to be completable in the same time as previous simpler ones. Particularly for evaluations that compare a model’s performance with the performance of expert humans, or which measure the “uplift” of giving humans access to the model, these are difficult to run quickly.

Was this overkill? Based on the outcomes in retrospect, it’s certainly easy to think it wasn’t needed. But the challenge is that OpenAI wasn’t yet sure—and needed to do this testing and iteration on its safety measures to feel comfortable.

In December 2024, OpenAI generated outcry for releasing a System Card titled as “OpenAI o1 System Card,” but which principally was about tests run on a different, not-quite-o1 model. A common issue is that AI companies are the sole arbiter on which of their models demand what forms of testing: In March 2025, Shakeel Hashim reports that Google DeepMind reneged on a commitment to publish a System Card accompanying its Gemini 2.5 Pro release, which the company described as “experimental”—and that in the absence of this, it isn’t clear which safety evaluations have been run. It’s possible that all necessary safety tests were completed. But as Shakeel writes, “In a self-regulation regime, transparency is the only way to enforce best practices.” [emphasis theirs] Previously, when OpenAI released its GPT-4.5 model, their System Card first claimed that “GPT-4.5 is not a frontier model” when describing what safety testing it was (or wasn’t) subject to. OpenAI quickly removed this characterization of GPT-4.5 as “not a frontier model” once commentators raised concerns.

Ref. Washington Post, “OpenAI promised to make its AI safe. Employees say it ‘failed’ its first test.”

Ultimately, we’ll want adherence to any testing requirements to be verifiable and audited by trusted third-parties. This allows others to trust that testing happened as described, even if they don’t have reason to trust the specific developer.

In October 2024, Anthropic took the noteworthy step of self-reporting that it had not completed recent evaluations on-time—finishing three days later than their committed three-month window. Going forward, Anthropic publicly “extend[ed] the interval to 6 months to avoid lower-quality, rushed elicitation.” Is this change the right call? Opinions might differ—but Anthropic deserves acknowledgment for communicating this transparently so that people can even have that debate. (In this case, the question is what time-lag is acceptable between a sufficiently capable model existing and when Anthropic completes evaluations on it—longer intervals have higher risk because a dangerous model might exist and Anthropic might not learn until a later time.) More generally, groups like AI Lab Watch and The Midas Project do useful work in tracking how companies’ commitments and practices have changed over time.

The trend is not new, even at companies for whom the issue has been reported publicly. For instance, last July the Washington Post reported that OpenAI’s Preparedness team—its most important safety-testing team—was not given sufficient time. Nine months later, in April 2025, the problem persists, with the Financial Times reporting on continued rushed safety testing at OpenAI.

Rules about a minimum testing period should also clearly define what counts as a frontier model subject to being tested, so that there is not variability in how companies uphold this rule. For ideas of how to define this, see Frontier AI Regulation: Managing Emerging Risks to Public Safety.

Splashier companies are likely to attract more funding, talent, buzz, etc., all of which are significant disadvantages for more cautious companies, which might struggle to even stay in business.

For a non-AI parallel, I recommend reading about the decision to launch the Space Shuttle Challenger, which tragically killed all seven crew members—blowing up physically because of low air temperatures, and socially because decisionmakers would not delay launch until temperatures were warmer.

Particularly in the case of open-weights models like Meta’s Llama series, the bell can’t be unrung once the model is made available. For closed-weights models available through API, the bell can be unrung but is more costly once a model has been launched, as customers may have made changes in workflows, come to depend on this new model, etc. For more, see Deployment Corrections: An incident response framework for frontier AI models.

Luca Righetti’s analysis of o1-preview’s CBRN evaluations is a good example of how public analysis can improve the rigor of an AI company’s safety-testing—in this case, after OpenAI had already released the model in question. If there had in fact been a serious issue, it would have been very nice for OpenAI to have considered Luca’s input before deployment. (OpenAI does use some external experts for red-teaming before external deployment of its models, but that is different than what I am describing here.)

Some example implementation details: What should be the minimum time between a model being ready and when it can be launched? How does this change if the system (the combination of model + safety mitigations around it) wasn’t ready and so couldn’t be tested until a later date? Should we consider a minimum lag between “when the last evaluation was run” and when a model can be launched? What about if the data for an evaluation wasn’t actually reviewed until substantially later? What about a target like “the company should be able to re-run any given evaluation without needing to change its launch date as a consequence”? These details are pretty fiddly and point at some of the challenges of procedural rules—but if we go this route, I expect people will need to figure out reasonable answers to them.

If public discourse shifted to “Of course minimum testing periods make sense—we just can’t do one because China won’t do one” then this would be a significant improvement; we could then focus on “Are there any ways to bilaterally enact/enforce/verify a minimum testing period?”

For this approach to work, the laws will need to be specific enough to be reasonably clear when a company has violated them. I consider other AI safety commitments to have not hit this standard historically—for instance, I read the Biden Administration’s “Voluntary AI Commitments” as not really committing the companies to doing meaningful safety evaluations, saying vague-ish things like, “Model safety and capability evaluations, including red teaming, are an open area of scientific inquiry, and more work remains to be done. Companies commit to advancing this area of research” and that companies “will ensure that they give significant attention to the following [considerations], where relevant to the model’s capabilities and use case”.

The FMF has previously released an issue brief focused on “Early Best Practices for Frontier AI Safety Evaluations,” though does not seem to have included practices to “ensure the safety team has ample time,” which could be a useful addition.

These possible rules are all inefficient in some sense, because they would apply equally to companies no matter their skill at safety testing. One alternative is to allow companies to be exempted from these minima if they follow some other suitable procedure. Another approach: let companies decide what safety testing is appropriate, but hold them liable if they cause a catastrophe and are found to have behaved unreasonably. Roughly this is what California’s SB 1047 would have done had it not been vetoed. (SB 1047 would have applied only for a small subset of AI developers and models—most would not have been affected.)

Anonymity might be of particular importance—though also more complicated to operationalize—if testers are from third-party organizations. These external testers are dependent on the frontier AI companies for early privileged access to the models, and so might be reluctant to “rock the boat” by complaining about time pressures. This is one advantage that government testing organizations, like the US’s AI Safety Institute and UK’s AI Security Institute, have: They are less dependent on personal relationships for the access they need in order to complete their testing. For more recommendations on establishing this well, see “Having a Whistleblowing Function Isn’t Enough.”

Robert Long

Apr 17

Minimum refueling time is a very interesting hook and illustration - literal racing dynamics!

Expand full comment

Oscar Delaney

Apr 23

Nice idea! This could all be worked out later in an implementation phase if this idea gains traction, but I wonder if there would be some difficulty in specifying when the testing period starts. To use your (excellent) refueling analogy, I suppose it is easy to measure exactly when the car comes to a halt, or something similar, whereas I wonder if AI companies seeking to avoid the spirit of the minimum testing period would say that they start testing once the base model is trained, but continue doing RL post-training alongside the safety testing, so that the safety testing window isn't actually adding time until the release.

1 reply by Steven Adler

1 more comment...

Clear-Eyed AI

Discussion about this post

Ready for more?