Explainer: The basics of AI monitoring

In short: You log interactions with an AI model to a database, analyze these interactions, and follow up on concerning patterns.

Apr 04, 2025

In a recent piece, I argue why it is important for AI companies to monitor internal use of their models.

But many people won’t know exactly what I mean by monitoring—and so I want to describe how monitoring works, and what design choices are involved.

This post should not be taken as endorsing any particular practice—just sharing knowledge about an important category of safety mitigation.

In particular, I do not try to justify in this post why an AI company might want to monitor uses of its AI, or describe the countervailing reasons not to (e.g., user privacy, data security). These practices also vary across AI companies.

AI monitoring means logging interactions with an AI model, analyzing these interactions, and following up on concerning patterns. Image credit: Steven Adler with ChatGPT.

Rough categories of safety mitigations

Broadly, the AI companies are trying to balance an equation when choosing what safety mitigations they will use: Reduce the harmful or risky uses of their models—but make it easy for users to accomplish their goals, and keep the costs of safety tooling low.

Two main safety strategies that AI companies put around their models are real-time gatekeeping and retroactive monitoring.

Real-time gatekeeping means stopping misuse before it happens—for example, by screening requests before they reach the model. For instance, maybe a quick classifier determines that the user is asking for something against policy, and so doesn’t even pass the request to the model.

But in many cases, real-time gating is expensive or inflexible.

So companies often use retroactive monitoring to find and follow up on problems after they’ve occurred. Especially if harmful uses of the model are rare (or hard to spot unless viewed in aggregate), this can be much more efficient, though it might allow some harmful uses to happen in the meantime until they can be rectified later.

(In addition to these two categories of mitigations, which are wrapped around a given model, there are also mitigations that alter a model—like fine-tuning it to refuse certain requests.)

What is needed for retroactive monitoring?

To make retroactive monitoring work, three components are essential: logging, analysis, and enforcement:

You need to put data about the model use somewhere (logging); search it for issues (analysis); and then do something about the issues (enforcement).

Each component—logging, analysis, and enforcement—adds to the overall maturity and effectiveness of the system.

1. Logging

Logging means capturing a record of how the model was used and storing the record somewhere.

A log can include a wide range of information, like the user’s prompt, the model’s response, or “metadata” about the request, like the country a user is accessing the model from.

It’s important to note: the model itself doesn’t do logging. Instead, monitor systems must be fed the information going to (or coming from) the model. This is typically achieved by placing a layer between the user and the model: This layer passes messages back and forth between the user and the model, and so can log these messages into a monitoring database.

2. Analysis

Analysis is about examining logs for signs of misuse or unusual behavior. Analysis can happen at several levels of detail:

Request-level: analyzing an individual message between a user and a model (e.g., the user’s prompt or the model’s completion),
User-level: analyzing a user’s activity in aggregate,
Userbase-wide: analyzing patterns across multiple users to identify coordinated abuse.

Analysis can also happen automatically (e.g., using recurring computer-driven reports), manually (e.g., using humans looking over the data), or some combination thereof.1

To find concerning uses of a model, sometimes keyword-scanning is enough.

But increasingly, companies use lightweight AI models to flag violations: These can catch nuances and slight changes in patterns that keyword-scanning can’t.

When designing an analysis system, some important design choices include:

How often to analyze new data (hourly, vs. daily, vs. weekly, etc.),
What percentage of new data to analyze2,
What kinds of risks to prioritize searching for (e.g. incitement of violence, erotic content, model jailbreaks),
And what techniques (e.g., which models) to use for finding these.

3. Enforcement

Enforcement means taking action against uses that are unwanted—like those that seem to violate a company’s policies. Enforcement actions might include:

Applying a warning, temporary suspension, or ban to the user, often using a form of “strike system,”3
Escalating serious incidents to other internal teams or even law enforcement (e.g., requests related to child exploitation).

Enforcement is a means of driving down the rate of future violations—either by restricting users from being able to take these actions in the future (e.g., if the account is suspended), or by notifying a user that they are violating the company’s policies and hoping that this induces behavior change.4 Enforcement can also include systems of appeals, for determining if the user should not have had their activity restricted.

A well-functioning monitoring system should let an AI company answer nearly any question about the use of their models (subject to privacy and other security restrictions):

“Has anyone tried to generate a novel bioweapon—and if so, what other accounts have made similar requests?”

Even if the company doesn’t yet have the analysis tools to answer this question efficiently, so long as they have the data logged, they can investigate it thoroughly if needed.

But if the AI company isn’t at least logging the data, some important answers might be lost to the sands of time.

If you enjoyed the article, please share it around; I’d appreciate it a lot. If you would like to suggest a possible topic or otherwise connect with me, please get in touch here. All of my writing and analysis is based solely on publicly available information.

Thanks for reading Steven Adler's Substack! This post is public so feel free to share it.

A standard combination involves a human following up on the signals generated by an automated report and investigating further to decide if they agree that the uses are concerning.

By randomly sampling some percentage of data to analyze, you can have much lower costs of monitoring, while still generally catching misuse that is prevalent enough to be of high concern. For instance, if an AI company only wants to take action against users who recurringly violate its policies, high-volume violators are likely to still show up in randomly-sampled reports.

A strike system generally entails progressively larger consequences for a user, depending on their number and severity of previous violations.

One challenge with notifying users is that this can help users to evade detection in the future, by tipping them off on how they were caught. Also, if a user knows that their account has already been flagged for suspicious activity, they might abandon that account and start fresh with one that isn’t flagged.

Clear-Eyed AI

Discussion about this post

Ready for more?