The OpenAI Moderation API doesn’t just block "bad words"; it actually predicts the likelihood of a piece of text belonging to specific categories of harmful content.
Let’s see it in action. Imagine you’re building a chat application and want to ensure user-generated content stays safe. You’d send user messages to the Moderation API.
import openai
openai.api_key = "YOUR_API_KEY"
response = openai.Moderation.create(
input="I'm going to kill you!"
)
print(response)
This might output something like:
{
"id": "mod-...",
"model": "text-moderation-004",
"results": [
{
"categories": {
"sexual": false,
"hate": false,
"harassment": true,
"self-harm": false,
"sexual/minors": false,
"hate/threatening": false,
"harassment/threatening": true,
"violence": true,
"violence/graphic": false
},
"category_scores": {
"sexual": 0.00012345,
"hate": 0.00000001,
"harassment": 0.98765432,
"self-harm": 0.00000012,
"sexual/minors": 0.00000005,
"hate/threatening": 0.00000002,
"harassment/threatening": 0.95000000,
"violence": 0.99000000,
"violence/graphic": 0.00000003
},
"flagged": true
}
]
}
Here, flagged is true because the text triggered the harassment and violence categories. The category_scores give you the API’s confidence level for each category, ranging from 0 to 1. You can set your own thresholds based on how strict you want your moderation to be. For example, you might decide that any text with a harassment score above 0.7 should be blocked.
The real power here is understanding what the API is and isn’t doing. It’s not a perfect blacklist; it’s a sophisticated classifier trained on a massive dataset of human-labeled content. This means it can catch nuanced forms of harm that simple keyword matching would miss, and it can also have false positives. The model is designed to identify various types of harmful content, including hate speech, harassment, self-harm, and sexual content, with different subcategories for more granular control.
When you send text, the API processes it through its internal neural network. This network has learned patterns and associations that correlate with different categories of unsafe content. The category_scores are the output probabilities from different softmax layers within this network, each tuned to recognize specific types of harmful language.
The crucial thing to remember is that the API is stateless. Each request is independent. It has no memory of previous inputs. This means you need to manage any context or conversation history on your end before sending it to the Moderation API. If you’re moderating a conversation, you might need to send segments of the conversation, or summaries, depending on your use case and how you want to detect evolving harmful patterns.
The text-moderation-004 model is the latest and most capable, offering a wider range of categories and improved accuracy over older versions. You can specify which model to use, though typically the latest is the default. The categories themselves are fairly stable, but the underlying model’s interpretation of them can evolve with new training data.
The API also provides a flagged boolean, which is a simple true/false indicating if any category exceeded a predefined internal threshold (which isn’t directly exposed but is generally set to be quite sensitive). For fine-grained control, you should always rely on the category_scores and implement your own logic based on your application’s tolerance for risk. For instance, you might flag content for review if harassment/threatening is above 0.8, but only block it outright if violence is above 0.95.
What most developers miss is that the categories object in the response can sometimes include categories that are not immediately obvious from the names alone. For instance, hate/threatening is distinct from hate and harassment/threatening is distinct from harassment. This distinction is important because a user might express hate without a direct threat, or make a general threat that isn’t necessarily hate speech. Understanding these granular differences allows for more nuanced moderation policies.
The next step after implementing moderation is often dealing with the consequences of flagged content, such as user appeals or automated content review workflows.