AI companies should be safety-testing the most capable versions of their models
Only OpenAI has committed to the strongest approach: testing task-specific versions of its models. But evidence of their follow-through is limited.
You might assume that AI companies run their safety tests on the most capable versions of their models possible. But there are important versions that currently go untested: task-specific versions, specially trained to demonstrate how far a model can be pushed on a dangerous ability, like novel bioweapon design. Without exploring and evaluating these versions, Al companies are underestimating the risks their models could pose.
Only OpenAI (where I previously worked1) has committed to what I consider to be the strongest vision for safety-testing: evaluating these task-specific versions of its models to understand the worst-case scenarios. I think OpenAI’s vision on this is truly laudable and great. But from my review of publicly available reports, it is not clear that OpenAI is in fact following through on this commitment. I believe that all leading AI companies should be doing this form of task-specific model testing - but if companies are not willing or able to today, I argue there are many middle-ground improvements they could pursue.
In this post, I will:
Walk you through why AI companies safety-test their models, and how they do it
Define task-specific fine-tuning (TSFT), and argue why it’s important for safety-testing
Summarize the status of TSFT at leading AI companies
Dig into a case study on OpenAI and TSFT
Outline challenges to safety testing using TSFT, and what middle-ground improvement exists
An overview of safety-testing at leading AI companies
AI companies increasingly recognize that their models could become very dangerous, particularly as models’ capabilities improve. The risks we’re talking about here are intense: for instance, one concern is whether models can soon help an ordinary person create a lethal bioweapon.2 As a result, many AI companies have committed to run safety tests to track their models’ performance on dangerous tasks. Then, as the models improve, they say they will strengthen their security and safety measures accordingly.
Every company’s exact approach to safety techniques and testing has a slightly different name and details: Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, Google DeepMind’s Frontier Safety Framework, and so on.
Often, a core approach of the companies is to apply safety techniques - like teaching the model not to answer risky questions - before making that version accessible to people outside the company through products like ChatGPT.
But applying safety techniques is not enough by itself, and the AI companies recognize this: A malicious actor could steal a copy of the model and bypass the safety techniques that were meant to make the model less risky. For instance, maybe the model was taught not to answer questions about bioweapons - but once you've stolen the model and can operate on it directly, these “refusals” are likely quite easy to undo.3
So, undoing safety-protections after theft is one concern of the AI companies, which is why many of them test versions of their models both before and after applying these safety techniques: The company then knows what harm is likely possible, if the protections are bypassed, and can plan accordingly.
But malicious actors can also increase a model’s risk through another pathway: not just by undoing safety protections, but by actively making the model stronger at dangerous tasks after theft. Unfortunately, this might be possible by adding an uncomfortably small number of new data points to an already-trained model, through what I’ll refer to as “task-specific fine-tuning.” These versions of a model represent the true potential danger a model poses, and without testing “task-specific fine-tuned” versions, companies could be dramatically underestimating their models’ risks - what can be achieved by just adding data.
Today almost none of the AI companies have committed to running safety-tests on these souped-up, task-specific versions of their models that a bad actor could modify. And the evidence suggests that even OpenAI, which it seems has committed to running safety-tests on these models, may not be doing this either.

What is task-specific fine-tuning, and why does it matter?
Task-specific fine-tuning (TSFT) is the process of giving a general AI model a small amount of extra training on specific examples, quickly making it much better at a particular task. This can teach a model new abilities—according to the AI companies, sometimes with very small amounts of new data.4
Just because the general off-the-shelf model can’t do a certain thing - or can’t do it consistently - doesn’t mean this ability is necessarily out of reach. For instance, maybe a task doesn’t have many good examples on the Internet. The abilities could still be within a model’s intellectual grasp; it just needs some examples to learn from.
This is where TSFT comes in. Let’s use the risk of bioweapons as an example. Maybe the general model doesn’t know how to combine biological ingredients in dangerous ways, because there’s not much on this in the original training dataset. But, once trained on a few examples (i.e., fine-tuned for this task), it starts to generate very good – read: very dangerous – new ideas. This is bad: This means anyone who can get their hands on the model could easily specialize it for dangerous purposes through fine-tuning.
To understand an AI model’s true risks, we need to know how much damage a malicious actor could cause with a TSFT version, just as we need to know how much damage an actor could cause if they undid the initial safety techniques. In both cases, these true risks might warrant much stronger improvements to company security before making a model available externally.5
Importantly, unless an AI company has actually run the experiment - i.e., thought hard about what data could teach a model to do a certain task, created a TSFT model with this data, and gathered evidence on its actual performance - the company doesn’t really know whether or not the model is capable of a particular dangerous ability.6 Without TSFT model testing, we could be – and we likely are – dramatically underestimating AI risks.7
Beyond “eliciting a model’s full capabilities,” Daniel Kokotajlo (also formerly of OpenAI) has described other reasons why it’s important to implement TSFT as part of safety-testing:
Evidence of future abilities: If we learn about the abilities of TSFT models today, we could have more time to prepare for the near-future models that can do these abilities off-the-shelf, as is a common progression in AI. As Daniel aptly puts it, “By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.”
Reliability of evaluation results: There’s growing evidence that models may know when they're being tested and intentionally hold back on revealing their dangerous capabilities.8 Fine-tuning on particular tasks might push models to give their real effort, making findings more reliable.9
In short: If you don’t test a version of your model specifically fine-tuned for a task, you might be significantly underestimating the damage a bad actor could do with your model.
And by not running experiments to discover this evidence early, we’re losing valuable time to prepare ourselves for possible coming threats.
What is the status of task-specific fine-tuning within AI companies’ safety-testing?
Because I would like more AI companies to implement TSFT as part of their safety-testing, I’ve reviewed a number of companies’ stated commitments and practices.
OpenAI is the only company I’ve analyzed which seems to commit to testing TSFT models. In its Preparedness Framework, OpenAI writes:
“We want to ensure our understanding of pre-mitigation risk takes into account a model that is “worst known case” (i.e., specifically tailored) for the given domain. For our evaluations, we will be running them not only on base models (with highly-performant, tailored prompts wherever appropriate), but also on fine-tuned versions designed for the particular misuse vector without any mitigations in place.” [emphasis mine]10
Anthropic does not commit to evaluating TSFT models, but does lay out a minimum set of other forms of fine-tuning which it will do prior to safety evaluations:
“Elicitation: Demonstrate that, when given enough resources to extrapolate to realistic attackers, researchers cannot elicit sufficiently useful results from the model on the relevant tasks. We should assume that jailbreaks and model weight theft are possibilities, and therefore perform testing on models without safety mechanisms (such as harmlessness training) that could obscure these capabilities. We will also consider the possible performance increase from using resources that a realistic attacker would have access to, such as scaffolding, finetuning, and expert prompting. At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates.” [emphasis mine]11
Google DeepMind mentions capability-elicitation in both the initial and revised versions of their safety framework, but does not seem to commit to specific practices:
V1.0: “We are working to equip our evaluators with state of the art elicitation techniques, to ensure we are not underestimating the capability of our models.”
V2.0: “Our evaluators continue to improve their ability to estimate what capabilities may be attainable by different threat actors with access to our models, taking into account a growing number of possible post-training enhancements.”
Meta also mentions capability-elicitation related to realistic malicious actors in their safety framework, but also does not seem to commit to specific practices:
“Our evaluations are designed to account for the deployment context of the model. … [T]o help ensure that we are appropriately assessing the risk, we prepare the asset – the version of the model that we will test – in a way that seeks to account for the tools and scaffolding in the current ecosystem that a particular threat actor might seek to leverage to enhance the model’s capabilities.”
Relative to these four companies, some other AI companies do not necessarily mention capability-elicitation within their safety frameworks at all.
Another important dimension of the AI companies’ practices is whether they make their leading model available for fine-tuning by the outside world, without the model needing to have been stolen: Anthropic does not seem to operate its own fine-tuning API and so is the most cautious here, whereas Google DeepMind does sell the ability to fine-tune its models, as does OpenAI (with safety-mitigations) - though neither Google DeepMind nor OpenAI yet sells fine-tuning of their leading model. Meta’s leading models are freely available for fine-tuning because they are publicly available by default, rather than sold through an API.
The following table summarizes leading AI companies’ commitments and practices with respect to task-specific fine-tuning (TSFT):
Case study: OpenAI and task-specific fine-tuning
As mentioned above, OpenAI is the only company of itself, Anthropic, Google DeepMind, and Meta that seems to have committed to TSFT as part of its frontier safety testing.
I applaud this vision: I think that OpenAI’s path would give us the strongest evidence to understand whether models could be used very destructively, if they were stolen by a malicious actor.
So it is worth asking: To what extent is there evidence that OpenAI has stuck to this commitment? If OpenAI has adhered to this, then maybe it’s quite viable for other companies to be doing TSFT testing, too. If OpenAI hasn’t adhered to this, it’s worth understanding the possible reasons why: The challenges in doing so might suggest ways to help AI companies universally achieve high safety standards.
In reviewing OpenAI’s recent System Cards for o1, o3-mini, Deep Research, and GPT-4.5, I do not find evidence that OpenAI is implementing its task-specific fine-tuning commitment in current safety testing (nor do I find evidence of more general domain-specific fine-tuning (DSFT)). For instance, in the “Preparedness Evaluations” section of the System Cards for o1 and o3-mini, there is only reference to one model that it seems may have been fine-tuned in this way: This is a “Fine-tuned GPT-4o” model, rather than a fine-tune of OpenAI’s most-recent leading models, o1 and o3-mini, which are the subjects of the full report. The fine-tuned GPT-4o was evaluated as part of the “Biotool and Wet Lab Actions” evaluations and compared against o1 and o3-mini’s performance on this. I find no evaluations for which a task-specifically fine-tuned o1 or o3-mini is listed, nor other evaluations that list having tested the fine-tuned GPT-4o. (OpenAI does refer at times to scaffolding provided to its model during evaluations, but these are not examples of TSFT.12)
Interestingly, fine-tuning does seem to have had an impact here: The weakest underlying model, GPT-4o, is put through fine-tuning and then outperforms much stronger models (o1, o3-mini) which were not fine-tuned in this way. In fact, the fine-tuned GPT-4o is the only tested model to have received above a 0% score on the AlphaFold task in this evaluation.
I am accordingly curious how OpenAI’s frontier models o1 and o3-mini would perform, if they were task-specifically fine-tuned. OpenAI first reported its scores for fine-tuned GPT-4o and o1 in the o1 System Card, published on December 5, 2024. In the o3-mini System Card, published on January 31, 2025, OpenAI adds scores to the figure for o3-mini, though does not report having run further evaluations on TSFT o1 or o3-mini at this time.13
(I want to clarify that I am working only from what OpenAI has said publicly about its evaluations, and so it is possible that OpenAI has privately conducted tests it has not reported: I last week shared my interpretation of this information with OpenAI’s Communications team and members of its safety leadership and requested to be pointed to any other evaluations that were run on a TSFT model, but have not received a reply.)
One possibility in interpreting this evidence is that OpenAI has revised its commitment internally but has not revised the Preparedness Framework’s language publicly, which was last updated in December 2023 when the team was under previous leadership.
For instance, I notice that though OpenAI’s language has not been adjusted in the Preparedness Framework, the company’s presentation of this topic might read as softer (e.g., “aim to”) in recent System Cards.
“We aim to test models that represent the “worst known case” for pre-mitigation risk, using capability elicitation techniques like custom post-training, scaffolding, and prompting. However, our evaluations should still be seen as a lower bound for potential risks. Additional prompting or fine-tuning, longer rollouts, novel interactions, or different forms of scaffolding are likely to elicit behaviors beyond what we observed in our tests or the tests of our third-party partners.”
Because commitments like OpenAI’s Preparedness Framework (and Anthropic’s Responsible Scaling Policy, and Google DeepMind’s Frontier Safety Framework, and so on) are voluntary rather than regulatorily required, OpenAI is probably within its rights to weaken this commitment, if it has chosen to. For instance, OpenAI refers to its Preparedness Framework as a living document, which it recently wrote “has revision as a built-in principle.” It can benefit the public to learn if such commitments do change, particularly within a context of voluntary commitments rather than regulation.
What are the barriers to task-specific fine-tuning, and what are some middle-ground improvements?
If you’re still with me at this point, maybe you’re convinced that it’s important to test TSFT models, and you’re wondering, “Why isn’t everyone doing this?” or “Where do we go from here?”
The fundamental challenges to task-specific fine-tuning as part of safety-testing are labor, cost, and complexity.14 Given these factors, I understand why many companies have elected not to conduct TSFT testing. And for some companies, there might be lower-hanging fruit for improving their evaluation processes than to adopt TSFT. But for companies that want to be on the leading edge of safety and security, it’s worth illustrating what is involved in conducting evaluations on TSFT models.
At a basic level, whenever a lab creates a new evaluation that gauges a model’s risk, it would then need to also create or procure a dataset for use in the fine-tuning. The rationale for this is that it better estimates how a malicious actor could increase the abilities of a stolen model: give it examples of doing a task well, and see how much it improves. But this adds additional cost and time to the companies’ testing practices. In some cases, the subject-matter expertise to create such datasets might not live within the companies. This could entail hiring and supervising additional contractors. And in some cases, government restrictions around classified information may significantly limit the pool of available contractors.15
Even if a company uses TSFT, there is a chance of fine-tuning on a low-quality dataset, which would still undersell the risk of the model being stolen: not because TSFT couldn’t make a difference, but just because the dataset was bad. This is similar to how a weak prompt may fail to elicit a general model’s abilities, even if the model really had the ability in it with the choice of a better prompt.
It is also challenging, from a competitive perspective, that these costs and complexity would tend to fall upon individual companies who opt into a more-cautious safety standard. An alternative would be for the frontier AI industry to spread out the cost of building excellent evaluations by sharing these evaluations with one another, and thus allowing all the companies to benefit from better safety-testing. But as far as I can tell, this is still largely not happening. Venues like the Frontier Model Forum seem well-positioned to tackle this problem, similar to how the Global Internet Forum to Counter Terrorism has created resources to help member companies avoid enabling terrorist activities.
Finally, one challenge is that if a safety evaluation triggers a high level of risk for an AI company, it is not apparent to me that the companies have their action plans fully set. Task-specific fine-tuned evaluations might show that a model is more capable of dangerous abilities than was previously thought - and so might require companies to act sooner than they would prefer on more-costly mitigations.
So, given the various challenges of task-specific fine-tuning for safety-testing, what are some middle ground solutions that the AI companies could begin taking today?
For each evaluation listed in one’s safety report, be more specific on the particular elicitation undertaken.
This helps the public to gauge the relative amounts of resourcing that went into these. For instance, rather than saying that the company did “prompt engineering,” tell us how many different prompts were tested, at what level of change across the prompts. Better yet, release a changelog of the prompts so that we can better understand the iterative process that goes into refining these tests.
When representing that fine-tuning was done as part of elicitation, clarify what type of fine-tuning occurred.
Saying a model was “fine-tuned” is not very clear because “fine-tuning” can mean many different things, with different implications for safety. This is why I’ve used the term “task-specific fine-tuning” (TSFT) here to be more precise.16 For instance, AI companies should differentiate between refusals fine-tuning, instruction-following fine-tuning, “helpful-only” fine-tuning, task-specific fine-tuning, domain-specific fine-tuning, etc. I also recommend that companies only describe having done “custom fine-tuning” or “custom post-training” if this process was in fact customized for a specific evaluation.
If unable to test on task-specific fine-tuned versions, then at least test on domain-specific fine-tunes (DSFT).
For instance, it would be useful to test a version of the model that has specialized in bio-knowledge more generally, even if it has not been given examples of specific bioweapons-related tasks. Notably, a DSFT model must go further than just removing the refusals that might stop a model from sharing its knowledge of the domain today; the model needs to actively be improved on its knowledge of the domain, just as a bad actor would try to do.
Enlist help from cross-industry organizations like the Frontier Model Forum, in addition to taking action within one’s own company.
Working jointly across the industry is a great way to get economies of scale on these topics, like creating domain-specific fine-tuning datasets which all the frontier AI companies should be using. But, given the urgency expressed by some AI company CEOs, we don’t have the time to be bottlenecked by any one group’s progress; individual companies still need to be taking useful actions and be responsible for testing the most informative versions of their models.
Publish estimates / confidence intervals on scores the company expects would be achieved on an evaluation, if the model were task-specifically fine-tuned.
If a company has good reason to believe that a TSFT version of its model would not score dangerously, it can be helpful to the public to articulate this view. If the company is not uncertain about how a TSFT model would score, this can also be helpful for the public to know. Another benefit of making specific estimates is that people can iteratively learn from and improve upon their estimates over time.
Another possible solution - though not a middle ground, as it’s likely substantially harder than just evaluating TSFT models17 - is to have achieved sufficient model-security that there is less need to test TSFT models.
Some (though not all) of the case for testing TSFT models is because the AI companies believe their models could be stolen. At that point, a malicious actor could do task-specific fine-tuning on the model, and so it is important to know ahead of time how much danger this might pose.
If the danger is high enough, then the company might take elevated steps to reduce the chance of theft, as some like Anthropic, Google DeepMind, and OpenAI have indicated they will do. But, if the company already believes that theft has been reduced to negligible levels, then this argument holds less weight.
Wrapping up
This is my first post on Substack, where I will be writing more about AI safety and security from the perspective of having worked inside the largest AI companies. As noted, all of my writing and analysis is based solely on publicly available information.18 If you would like to suggest a possible topic or otherwise connect with me, please get in touch here.
Acknowledgements: Thank you to Dan Alessandro, Daniel Kokotajlo, Michael Adler, Michael Chen, Rosie Campbell, and Zach Stein-Perlman for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party.
I worked at OpenAI for a bit under four years, from December 2020 through November 2024, though all my writing and communications are based only on public information. I have signed a non-disclosure agreement with OpenAI related to confidential information learned during my time working there, but I have not signed (nor been asked to sign) a non-disparagement agreement. At OpenAI, I worked on a number of safety-related topics, such as co-leading our work on dangerous capability evaluations (and some of the evaluations that we designed are run as part of OpenAI’s Preparedness Framework).
In its System Card for Deep Research, OpenAI warns that their models are “on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold,” and encourages us to “prepare for a world where the informational barriers to creating such threats are drastically lower.” Similarly, Anthropic’s System Card for Claude 3.7 Sonnet warns “based on what we observed in our recent CBRN testing [chemical, biological, radiological, and nuclear testing], we believe there is a substantial probability that our next model may require ASL-3 safeguards [significantly elevated safeguards, which the company has not yet achieved].” (h/t Luca Righetti)
Stanford researchers have reported that some refusals training can be undone by as few as 10 datapoints, if you have the ability to fine-tune the model. Anthropic’s Responsible Scaling Policy makes a similar assumption of “likely ease of bypassing or removing safeguards via fine-tuning” when determining what deployments are sufficiently safe and secure.
More generally, there are two main types of safety techniques, each of which can be bypassed once a model has been stolen: The first type are model-specific techniques, like training the model to refuse queries - which can be undone with further fine-tuning. The second type are system-level techniques, which improve overall safety performance by wrapping other technologies around the model’s use. For instance, maybe an AI company has a monitoring system which will warn or suspend users if they violate the company’s policy too many times (but this system is not an intrinsic part of the model). Or maybe an AI company has per-minute rate-limits that are meant to limit how frequently a user can do something risky. But if the model has been stolen, it can now be used by a malicious actor directly, without usage being mediated by those protective technologies.
OpenAI has written that users “typically see clear improvements from fine-tuning on 50 to 100 training examples,” and Google DeepMind has written “You can fine-tune a model with as little as 20 examples,” though advises, “You should target between 100 and 500 examples, depending on your application. “ These are likely lower-bounds on the data need, however; a larger number of datapoints could further improve performance. It is also conceivable that task-specific fine-tuning would not improve performance very much on the types of abilities being evaluated during safety-testing, which is an experiment I would be very excited for the AI companies to conduct and publish.
Making a model available by API to the external world can increase the chance of the model being stolen, because there is now a larger “surface area” where an issue could compromise the model’s security. Commonly, companies will have different security standards for products that interact directly with the outside world, vs. products that are limited to the company’s internal network.
An AI company might not know the appropriate risk classification of a model without having evaluated a task-specific fine-tuned (TSFT) version of it. For instance, a TSFT model might be strong enough to register as High Risk under OpenAI’s Preparedness Framework rather than the Medium Risk of today - which is the risk level where OpenAI says security is meant to be specifically stepped up to greatly reduce risk of model theft. Similarly, a TSFT model might trigger Anthropic’s capability level of ASL-3 rather than the ASL-2 of today - which similarly is a threshold for substantially elevated security. But if companies are not running evaluations on task-specific fine-tuned models, the models might already be High Risk / ASL-3 if they were to fall into the wrong hands.
The exact size of performance gains from TSFT on safety-critical tasks is an open empirical question, and one where analysis would be aided by more AI companies running these tests. Outside of safety-testing, it is a common practice for companies to fine-tune their models on data from a particular domain if they wish to make the model maximally capable in that area. For instance, Alphabet’s model focused on medical reasoning - Med-PaLM 2 - was understandably created in part with “medical domain-specific fine-tuning.” Similarly, researchers at Ohio State have demonstrated significant improvements in chemistry abilities by fine-tuning models on datasets related to this domain.
For research on models sandbagging during evaluations, see e.g., AI Sandbagging: Language Models can Strategically Underperform on Evaluations, Sabotage Evaluations for Frontier Models, Frontier Models are Capable of In-context Scheming. My previous team at OpenAI also open-sourced an evaluation for sandbagging that you might find helpful: https://github.com/openai/evals/tree/main/evals/elsuite/sandbagging
Roger Grosse of Anthropic has written that, when contending with a very capable AI system that might be sandbagging, a reasonable elicitation strategy might include “Wherever possible, run supervised fine-tuning on the eval. (This necessitates looking for versions of evals which allow for supervised fine-tuning.)” Note that this does not reflect a specific position of Anthropic’s, however, and is oriented toward assessing the safety of AI models more capable than today’s.
It is possible to read this as OpenAI committing to testing on a domain-specific fine-tuned model rather than a task-specific fine-tuned model: That is, testing a model that has been trained to excel at biology generally, rather than models that have been trained to do the particular biological tasks of concern. I read “fine-tuned versions designed for the particular misuse vector” as being about the specific misuse task at-hand, rather than just the general domain of knowledge, but reasonable opinions could differ here. Unfortunately, I do not find evidence that OpenAI is evaluating either task-specific or domain-specific fine-tuned (DSFT) models, and so this distinction may not matter for the time being.
Anthropic also says that it leaves “headroom” in analysis of its evaluations to try to account for the possibility of a model being stolen and then further enhanced - aka, a safety buffer where a model can’t be too close to dangerous abilities, or else it’ll be treated as potentially-already-dangerous: “We include headroom to account for the possibility that the model is either modified via one of our own finetuning products or stolen in the months following testing, and used to create a model that has reached a Capability Threshold. That said, estimating these future effects is very difficult given the state of research today.”
Scaffolding generally refers to “helper tools” that are provided to a model, such as being able to write thoughts to a scratchpad it can refer back to, or being able to browse the Web (for models without this native functionality). It makes sense to run evaluations on models that have access to this scaffolding, and I am glad that many AI companies are doing this: Models with scaffolding will tend to have higher performance than those without, and so this better estimates the level of risk that a model might reasonably pose. Specifically, OpenAI writes that for some tests they “evaluated our models with a variety of custom scaffolds as well as the Ranger scaffold for capability elicitation.” Scaffolding is different from task-specific fine-tuning, in that scaffolding is not making any changes to the underlying model, whereas TSFT is about specifically tweaking the model to do better on a type of task by learning from experience.
I also reviewed OpenAI’s GPT-4o System Card, published August 8, 2024, to see if there is earlier evidence of task-specific fine-tuning. I do not find evidence of this, beyond one evaluation within the report that mentions a “custom research-only version of GPT-4o,” which I believe to be something different: A model that has been trained not to refuse to answer sensitive bioquestions, as opposed to a model that has been trained to be specifically strong at answering sensitive bioquestions. (In more general terms, this is a model that is “helpful-only”: it will help the user with its questions, without considering whether the questions might be harmful.)
These factors associated with TSFT (labor, cost, complexity) would be useful to have more specific estimates on, which could be gleaned by a company undertaking this process for a certain number of its evaluations.
There is also a real tradeoff in creating datasets that help a model to be good at dangerous tasks: This is similar to gain-of-function research conducted in some biological labs, and needs to be undertaken with quite a lot of caution. For instance, if a malicious actor does steal the general model, they potentially could steal this dataset as well, which otherwise might have been harder to tap into. This is a real challenge, though I don’t think warrants not doing TSFT as a consequence.
Even within task-specific fine-tuning, there might be performance differences between versions of this done with supervised fine-tuning, as opposed to fine-tuning with reinforcement learning. It would therefore be useful to know which of these methods has been used in the TSFT version. This also points at a more general complication of TSFT safety-testing: For a given dangerous task, there are multiple different TSFT models that a company could create and test - trained on slightly different data for teaching the task, and trained with slightly different techniques. For now, it would be an improvement to evaluate a TSFT model at all, even if there is a slightly stronger TSFT model that the company could have created instead.
Miles Brundage has written on many of the challenges of strong model-security, and why it is worth pursuing nonetheless. See also RAND’s report on Securing Model Weights for a thorough review of security challenges.
As noted, all of my writing and analysis is based solely on publicly available information. OpenAI’s recent safety-testing is documented in its System Cards for o1, o3-mini, Deep Research, GPT-4.5, and GPT-4o. Anthropic’s recent safety-testing is documented in its System Card for Claude 3.7 Sonnet. Google DeepMind’s recent safety-testing is documented in “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context” and “Gemma 2: Improving Open Language Models at a Practical Size”. Meta’s recent safety-testing is documented in “The Llama 3 Herd of Models”.