Adversarial inputs don’t just trick LLMs into saying bad things; they exploit the fundamental way LLMs process information, revealing a surprising fragility in their reasoning process.

Let’s watch a simple LLM try to classify sentiment on a slightly "off" review.

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Normal input
print(classifier("This movie was absolutely fantastic! I loved every minute of it."))
# Output: [{'label': 'POSITIVE', 'score': 0.9998787641525269}]

# Adversarial input
print(classifier("This movie was absolutely fantastic! I loved every minute of it. NOT."))
# Output: [{'label': 'NEGATIVE', 'score': 0.999770343208313}]

See how adding a single, seemingly nonsensical word flips the sentiment? This isn’t a bug; it’s a feature of how these models work. They’re essentially sophisticated pattern-matchers. Adversarial attacks are designed to subtly alter these patterns without fundamentally changing the meaning for a human.

The core problem LLMs face with adversarial inputs is their reliance on token probabilities. When you add "NOT." at the end, the model doesn’t understand it as a negation of the entire sentence. Instead, it sees a new token sequence that, based on its training data, has a higher probability of appearing in negative contexts. The model doesn’t reason about the negation; it associates the new pattern with a negative outcome.

To defend against this, we need to make our prompts more robust, essentially building in checks and balances that the LLM can follow.

1. Instruction Rephrasing and Constraint Reinforcement

Instead of a simple instruction, break it down and explicitly state what not to do.

  • Vulnerable Prompt: "Summarize the following article."
  • Robust Prompt: "Summarize the following article. Your summary must be objective and avoid any subjective opinions or emotional language. Focus only on the main facts presented in the text."

Why it works: This adds explicit constraints that the LLM can latch onto. If an adversarial input tries to inject bias, the LLM has a clearer directive to stick to objectivity.

2. Adding "Guardrail" Prompts or Examples

Provide examples of both desired and undesired outputs, especially highlighting how to handle tricky or potentially misleading inputs.

  • Prompt with Examples: "Classify the sentiment of the following review. Review: 'The service was slow, but the food was amazing.' Desired Output: Mixed

    Review: 'I hated this place, worst experience ever.' Desired Output: Negative

    Review: 'This product is good. NOT.' Desired Output: Positive (Explain that 'NOT' is likely a typo or irrelevant noise.)"

Why it works: By showing the LLM how to handle ambiguous or noisy inputs, you’re essentially teaching it a specific, desired behavior pattern for those cases, rather than relying on its general probabilistic tendencies.

3. Input Sanitization and Filtering (Pre-LLM)

This is a crucial step that happens before the input even reaches the LLM. You can use simpler NLP techniques or even regex to catch common adversarial patterns.

  • Check: Look for common adversarial suffixes or prefixes.
  • Example Filter (Python):
    def sanitize_input(text):
        # Remove common adversarial suffixes that don't alter human meaning
        text = text.rstrip(" .!?;,") # Remove trailing punctuation
        if text.lower().endswith(" not"):
            text = text[:-4] # Remove trailing " not"
        return text
    
    adversarial_text = "This is great! NOT."
    clean_text = sanitize_input(adversarial_text)
    print(clean_text) # Output: This is great!
    

Why it works: This mechanically removes or alters the adversarial perturbation before the LLM even sees it, preventing the model from being misled by the subtle changes.

4. Prompt Chaining and Verification

Use multiple LLM calls. The first call might be a general response, and a second call verifies or refines it based on specific criteria.

  • Prompt 1: "Extract the key entities from this text: [user input]"
  • Prompt 2: "Given these extracted entities: [entities from Prompt 1], and the original text: [user input], does the text express a positive or negative sentiment towards entity X? Respond with only 'Positive', 'Negative', or 'Neutral'."

Why it works: The first prompt gets the basic information. The second prompt, with the added context of the first prompt’s output and the original text, forces the LLM to re-evaluate and confirm its understanding, making it harder for a single adversarial tweak to derail the entire process.

5. Temperature and Sampling Strategy Adjustments

While not strictly prompt engineering, adjusting LLM inference parameters can help. Lowering the temperature makes the output more deterministic and less prone to "creative" interpretations that adversarial inputs might exploit.

  • API Call Example (OpenAI): openai.Completion.create(model="text-davinci-003", prompt="...", temperature=0.2)

Why it works: A lower temperature means the model is more likely to pick the single most probable next token, reducing the chance it will go down an unexpected path due to a subtly perturbed input.

6. Self-Correction Mechanisms within the Prompt

Instruct the LLM to review its own output and correct errors, especially those related to the original instructions.

  • Prompt: "Analyze the following customer feedback: '[user feedback]'. First, identify the core sentiment (Positive, Negative, Neutral). Second, identify any specific product features mentioned. Finally, review your identified sentiment and features. If the feedback contains contradictory statements or attempts to mislead (e.g., adding nonsensical phrases at the end), re-evaluate and provide your final, corrected sentiment and features. Output Format: Sentiment: [Final Sentiment] Features: [List of Features]"

Why it works: This builds a loop where the LLM acts as both the generator and a basic verifier. It forces the model to re-examine its initial interpretation in light of the instructions, making it more resilient to inputs designed to trick that initial interpretation.

The next challenge you’ll face is making these defenses computationally efficient, as complex prompt chaining or extensive pre-filtering can increase latency and cost.

Want structured learning?

Take the full Prompt-engineering course →