Compress Your Prompts: Smarter AI, Lower Costs
November 19, 2025
TL;DR
- Shorter, focused prompts reduce token usage and lower API costs significantly.
- Concise prompts often yield better accuracy by reducing the "lost in the middle" effect.
- LLMLingua (Microsoft Research) achieves up to 20x compression with minimal quality loss.
- GIST tokens enable 26x compression through learned embeddings (requires model fine-tuning).
- PCToolkit provides a unified framework for comparing compression methods.
- In production, expect 50–80% cost savings; research setups can achieve 90%+ in ideal cases.
What You'll Learn
- Why prompt compression matters — the economic and accuracy benefits.
- How to use LLMLingua — with working code examples.
- The differences between LLMLingua variants — LLMLingua, LongLLMLingua, and LLMLingua-2.
- When GIST tokens and PCToolkit are appropriate — and their limitations.
- How to test and monitor compression in production systems.
Prerequisites
You'll get the most from this article if you:
- Have basic familiarity with LLM APIs (OpenAI, Anthropic, or similar).
- Understand what tokens are and how they affect pricing.
- Know how to write and structure prompts for generative AI.
Introduction: Why Prompt Compression Matters
Every token you send to an LLM costs money. Whether you're building a chatbot, RAG system, or autonomous agent, your bill scales with token count.
But cost isn't the only factor. Research shows that long prompts can actually reduce accuracy. The "Lost in the Middle" paper1 demonstrates that LLMs struggle with information positioned in the middle of long contexts — performance follows a U-shaped curve, with best results when relevant information appears at the beginning or end.
Prompt compression addresses both problems: reducing costs while potentially improving output quality.
The Economics of Tokens
Most LLM APIs charge per token for both input and output. Current pricing as of November 2025:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
Pricing can change — always confirm on the official OpenAI and Anthropic pricing pages before budgeting.
Cost Savings Example
| Scenario | Input Tokens | Monthly Volume | Monthly Input Cost (Haiku 4.5) | With 50% Compression |
|---|---|---|---|---|
| Chatbot | 2,000/request | 1M requests | $2,000 | $1,000 |
| RAG System | 5,000/request | 500K requests | $2,500 | $1,250 |
| Code Analysis | 10,000/request | 100K requests | $1,000 | $500 |
At scale, compression directly impacts profitability — especially on input-heavy workloads.
The Accuracy Paradox: Why Shorter Can Be Better
It's intuitive to think more context equals better results. But for LLMs, verbosity introduces problems:
The "Lost in the Middle" Effect
Research by Liu et al.1 found that LLM performance degrades significantly when relevant information appears in the middle of long contexts. Performance is highest when key information is at the beginning or end of the prompt.
Why This Happens
- Attention dilution: Transformer attention spreads across all tokens, reducing focus on critical information.
- Noise accumulation: Redundant or irrelevant content can confuse the model.
- Position bias: Models trained on certain patterns may weight positions differently.
Compression Benefits
The LongLLMLingua paper2 reports improvements on specific benchmarks:
- 21.4% accuracy improvement on NaturalQuestions (multi-document QA at position 10)
- Significant cost reduction on long-context benchmarks
These gains are task-specific, particularly for RAG and long-context scenarios. Results vary by use case — always test on your specific workload.
LLMLingua: The Leading Compression Tool
LLMLingua3 is an open-source compression framework from Microsoft Research, published at EMNLP 2023 and ACL 2024. It uses perplexity-based token filtering to remove redundant information while preserving semantic meaning.
Installation
pip install llmlingua
Basic Usage
from llmlingua import PromptCompressor
# Initialize the compressor
llm_lingua = PromptCompressor()
original_prompt = """
You are an expert data analyst. Please analyze the following dataset
and provide insights about trends, anomalies, and correlations.
Be concise but detailed in your analysis. Make sure to explain any
patterns you observe and provide actionable recommendations based on
the data. Consider both short-term and long-term implications.
The dataset contains quarterly sales figures from 2020 to 2024.
"""
# Compress the prompt
compressed_result = llm_lingua.compress_prompt(
original_prompt,
target_token=50, # Target compressed length
)
print(f"Original length: {len(original_prompt.split())} words")
print(f"Compressed: {compressed_result['compressed_prompt']}")
print(f"Compression ratio: {compressed_result['ratio']:.2f}")
How LLMLingua Works
LLMLingua uses a small language model (like LLaMA-7B or GPT-2) to calculate perplexity for each token:
- High perplexity tokens = surprising/informative → keep these
- Low perplexity tokens = predictable/redundant → safe to remove
The algorithm preserves semantic meaning by keeping tokens that carry the most information.
Specifying a Compression Model
from llmlingua import PromptCompressor
# Use a specific model for perplexity calculation
llm_lingua = PromptCompressor(
model_name="NousResearch/Llama-2-7b-hf",
device_map="cuda" # Use GPU if available
)
# For faster compression with smaller model
llm_lingua_fast = PromptCompressor(
model_name="gpt2",
device_map="cpu"
)
Preserving Important Tokens
Force certain tokens to be kept regardless of perplexity:
compressed = llm_lingua.compress_prompt(
original_prompt,
target_token=100,
force_tokens=['\n', '?', ':', 'API', 'error', 'function'] # Always keep these
)
LLMLingua Variants: Which One to Use?
Microsoft Research has released three variants, each optimized for different use cases:
| Variant | Best For | Key Feature | Speed |
|---|---|---|---|
| LLMLingua | General compression | Coarse-to-fine perplexity filtering | Baseline |
| LongLLMLingua | RAG / long contexts | Question-aware compression + reordering | Similar |
| LLMLingua-2 | Production / speed | BERT encoder + GPT-4 distillation | 3–6x faster |
LLMLingua (Original)
Published at EMNLP 2023. Best for general-purpose compression.
- Up to 20x compression with minimal performance loss on benchmarks like GSM8K
- Uses iterative token pruning based on perplexity
- Works with any downstream LLM (black-box compatible)
LongLLMLingua
Published at ACL 2024. Optimized for RAG and long-context scenarios.
- Question-aware compression: Prioritizes tokens relevant to the query
- Document reordering: Moves important content to beginning/end (addresses "lost in the middle")
- Best results on multi-document QA tasks
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()
# LongLLMLingua-style compression with question awareness
compressed = llm_lingua.compress_prompt(
context=long_document,
question="What were the Q3 revenue figures?",
target_token=500,
reorder_context="sort" # Reorder by relevance
)
LLMLingua-2
Published at ACL 2024 Findings. Optimized for speed and production use.
- Uses BERT-level encoder instead of LLaMA (much smaller)
- Trained via GPT-4 data distillation
- 3–6x faster than original LLMLingua
- 1.6–2.9x lower end-to-end latency
- Achieves 2–5x compression (more conservative than original)
from llmlingua import PromptCompressor
# LLMLingua-2 configuration
llm_lingua_2 = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True
)
compressed = llm_lingua_2.compress_prompt(
original_prompt,
target_token=100
)
GIST Tokens: Extreme Compression via Learned Embeddings
GIST tokens4 take a fundamentally different approach. Instead of removing tokens, gisting learns compressed embeddings that represent entire prompts.
Key Characteristics
- Published at NeurIPS 2023 (Stanford/UC Berkeley)
- Achieves up to 26x compression with 40% FLOPs reduction
- Requires white-box model access (not compatible with API-only services)
- Needs fine-tuning infrastructure
How GIST Works
- Train a model to encode long prompts into a small number of "gist" tokens (e.g., 1–10 tokens)
- These gist tokens serve as compressed context for subsequent queries
- The model learns to reconstruct semantic meaning from compressed representations
Conceptual Example
Without GIST:
[500-token system prompt] + [user query] → LLM → response
With GIST:
[10 gist tokens representing system prompt] + [user query] → LLM → response
Limitations
- Not API-compatible: Requires access to model internals
- Training required: Must fine-tune for your specific use case
- Model-specific: Gist tokens trained for LLaMA won't work with GPT
Repository
Pre-trained models available for LLaMA-7B and FLAN-T5-XXL: https://github.com/jayelm/gisting
PCToolkit: Unified Compression Framework
PCToolkit5 provides a standardized interface for comparing multiple compression methods side-by-side.
Installation
# Clone the repository
git clone https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.git
cd Toolkit-for-Prompt-Compression
# Install dependencies
pip install -r requirements.txt
You'll also need to download models — most are available from Hugging Face Hub, but SCRL models require manual download (see the /models folder in the repository for instructions).
Included Methods
PCToolkit integrates five compression approaches:
- Selective Context — Rule-based filtering
- LLMLingua — Perplexity-based compression
- LongLLMLingua — Question-aware compression
- SCRL — Reinforcement learning approach
- Keep it Simple — Minimal compression baseline
Usage Example
from pctoolkit.compressors import PromptCompressor
compressor = PromptCompressor(type='SCCompressor', device='cuda')
test_prompt = "Your long prompt here..."
ratio = 0.5
result = compressor.compressgo(test_prompt, ratio)
print(result)
When to Use PCToolkit
- Research: Benchmarking compression methods
- Evaluation: Finding the best method for your use case
- A/B testing: Comparing approaches in production experiments
Choosing the Right Tool
| Use Case | Recommended Tool | Why |
|---|---|---|
| General compression | LLMLingua | Well-tested, easy to use |
| RAG systems | LongLLMLingua | Question-aware, handles long docs |
| Production (speed critical) | LLMLingua-2 | 3–6x faster |
| Maximum compression | GIST tokens | 26x compression (if you can fine-tune) |
| Research/comparison | PCToolkit | Unified benchmarking |
| API-only access | LLMLingua/LLMLingua-2 | No model internals needed |
Realistic Compression Expectations
Based on published research, here's what to expect:
| Method | Typical Compression | Best Case | Quality Impact |
|---|---|---|---|
| LLMLingua | 4–10x | 20x | Minimal on most tasks |
| LLMLingua-2 | 2–5x | 5x | Minimal, faster |
| LongLLMLingua | 4x (~75% reduction) | Similar | Can improve RAG accuracy |
| GIST tokens | 10–26x | 26x | Requires fine-tuning |
In production, 50–80% cost savings (2–5x compression) are realistic with LLMLingua-2. The 20x compression (95% savings) represents best-case scenarios on specific benchmarks — always validate on your workload.
Production Integration
Building a Compression Pipeline
from llmlingua import PromptCompressor
from anthropic import Anthropic
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize compressor and LLM client
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True
)
client = Anthropic()
def query_with_compression(
prompt: str,
target_ratio: float = 0.5,
model: str = "claude-haiku-4-5-20251001"
) -> dict:
"""Query LLM with compressed prompt, returning response and metrics."""
# Compress the prompt
compressed_result = compressor.compress_prompt(
prompt,
rate=target_ratio
)
compressed_prompt = compressed_result['compressed_prompt']
compression_ratio = compressed_result['ratio']
logger.info(f"Compression ratio: {compression_ratio:.2%}")
# Query the LLM
message = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": compressed_prompt}]
)
return {
"response": message.content[0].text,
"original_length": len(prompt.split()),
"compressed_length": len(compressed_prompt.split()),
"compression_ratio": compression_ratio,
"input_tokens": message.usage.input_tokens,
"output_tokens": message.usage.output_tokens
}
Compressing Conversation History
For chat applications, compress older messages while keeping recent ones intact:
def compress_conversation(
messages: list[dict],
keep_recent: int = 2,
target_ratio: float = 0.5
) -> list[dict]:
"""Compress older conversation history while preserving recent messages."""
if len(messages) <= keep_recent:
return messages
# Split into old (compress) and recent (keep)
old_messages = messages[:-keep_recent]
recent_messages = messages[-keep_recent:]
# Combine old messages into text
history_text = "\n".join([
f"{m['role'].upper()}: {m['content']}"
for m in old_messages
])
# Compress
compressed = compressor.compress_prompt(
history_text,
rate=target_ratio
)
# Return as summary + recent messages
return [
{"role": "system", "content": f"Previous conversation summary:\n{compressed['compressed_prompt']}"},
*recent_messages
]
Testing and Validation
Semantic Similarity Testing
Verify compressed prompts maintain meaning:
import pytest
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
def test_semantic_preservation():
"""Verify compression preserves semantic meaning."""
original = """
Analyze the quarterly sales data and identify trends.
Focus on year-over-year growth and seasonal patterns.
Provide actionable recommendations for Q4.
"""
compressed_result = compressor.compress_prompt(original, rate=0.5)
compressed = compressed_result['compressed_prompt']
# Compute embeddings
orig_embedding = model.encode([original])
comp_embedding = model.encode([compressed])
# Calculate similarity
similarity = cosine_similarity(orig_embedding, comp_embedding)[0][0]
assert similarity > 0.80, f"Semantic similarity too low: {similarity:.2f}"
def test_compression_ratio():
"""Verify target compression ratio is achieved."""
original = "A " * 200 # 200 tokens
compressed_result = compressor.compress_prompt(original, rate=0.5)
actual_ratio = len(compressed_result['compressed_prompt'].split()) / 200
# Allow 20% tolerance
assert 0.4 <= actual_ratio <= 0.6
Integration Testing
def test_end_to_end_pipeline():
"""Test full compression + LLM query pipeline."""
test_prompt = """
You are a helpful assistant. Please summarize the following:
The quick brown fox jumps over the lazy dog. This sentence
contains every letter of the alphabet and is commonly used
for typing practice and font demonstrations.
"""
result = query_with_compression(test_prompt, target_ratio=0.6)
assert "response" in result
assert len(result["response"]) > 0
assert result["compression_ratio"] < 0.8
Monitoring in Production
Track these metrics:
from dataclasses import dataclass
from datetime import datetime
import json
@dataclass
class CompressionMetrics:
timestamp: str
original_tokens: int
compressed_tokens: int
compression_ratio: float
semantic_similarity: float
llm_input_tokens: int
llm_output_tokens: int
model: str
def log_metrics(metrics: CompressionMetrics):
"""Log compression metrics for monitoring."""
print(json.dumps({
"timestamp": metrics.timestamp,
"original_tokens": metrics.original_tokens,
"compressed_tokens": metrics.compressed_tokens,
"compression_ratio": f"{metrics.compression_ratio:.2%}",
"semantic_similarity": f"{metrics.semantic_similarity:.3f}",
"model": metrics.model
}))
Key Metrics to Track
| Metric | Target | Alert Threshold |
|---|---|---|
| Compression ratio | 40–60% kept | >80% (too little compression) |
| Semantic similarity | >0.85 | <0.75 (meaning loss) |
| Latency overhead | <100ms | >500ms |
| Task accuracy | Baseline ±5% | >10% degradation |
Security Considerations
Preserve Safety Instructions
Never compress system prompts or safety guardrails:
def safe_compress(system_prompt: str, user_content: str) -> str:
"""Compress user content while preserving system instructions."""
# Only compress user content
compressed_user = compressor.compress_prompt(
user_content,
rate=0.5
)['compressed_prompt']
# Combine with preserved system prompt
return f"{system_prompt}\n\nUser query: {compressed_user}"
Validate Compressed Output
Check that compression doesn't expose or remove sensitive patterns:
def validate_compression(original: str, compressed: str) -> bool:
"""Validate compressed output for security concerns."""
sensitive_patterns = ['api_key', 'password', 'secret', 'token']
for pattern in sensitive_patterns:
if pattern in compressed.lower() and pattern not in original.lower():
return False
return True
When NOT to Use Compression
| Scenario | Reason |
|---|---|
| Short prompts (<100 tokens) | Overhead exceeds benefit |
| Legal/medical/compliance text | Risk of losing critical details |
| Code with exact syntax requirements | May break syntax |
| Creative writing | May lose stylistic nuance |
| One-off queries | Setup overhead not justified |
Troubleshooting Guide
| Issue | Cause | Solution |
|---|---|---|
ModuleNotFoundError: llmlingua |
Not installed | pip install llmlingua |
CUDA out of memory |
Model too large | Use smaller model or CPU |
Compressed output incoherent |
Over-compression | Increase target ratio to 0.5–0.7 |
Key information missing |
Important tokens removed | Add to force_tokens |
| Slow compression | Using large model | Switch to LLMLingua-2 |
Key Takeaways
- Prompt compression saves 50–80% on input costs in production, with up to 95% possible in research scenarios.
- LLMLingua is the most accessible tool — install via
pip install llmlinguaand use thePromptCompressorclass. - LLMLingua-2 is best for production (3–6x faster, 2–5x compression).
- LongLLMLingua is best for RAG systems (question-aware compression).
- GIST tokens offer maximum compression but require fine-tuning and white-box model access.
- Always test semantic preservation with similarity scores >0.80.
- Never compress system prompts or safety instructions.
- Results are task-specific — always benchmark on your workload.
FAQ
Q1: How much can I realistically save?
In production, 50–80% input cost savings (2–5x compression) are typical with LLMLingua-2. The 20x compression (95% savings) is achievable on specific benchmarks but not universal.
Q2: Will compression reduce output quality?
It depends on the task. For RAG and long-context scenarios, compression can improve quality by 10–20% by addressing "lost in the middle" effects. For other tasks, quality is usually maintained within 5% of baseline.
Q3: Which tool should I start with?
LLMLingua via pip install llmlingua. Use the PromptCompressor class. For production speed, switch to LLMLingua-2.
Q4: Can I use compression with OpenAI/Anthropic APIs?
Yes. LLMLingua and LLMLingua-2 are "black-box compatible" — they compress text before sending to any API. GIST tokens require white-box model access.
Q5: What's the best compression ratio?
Start with 50% (rate=0.5). Test semantic similarity — maintain >0.80. For aggressive compression, don't go below 30% without careful testing.
Q6: How does compression affect latency?
Compression adds 50–200ms overhead but reduces LLM inference time (fewer tokens to process). Net effect is usually positive for prompts >1000 tokens.
References
Footnotes
-
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" — TACL 2024 https://arxiv.org/abs/2307.03172 ↩ ↩2
-
Jiang et al., "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression" — ACL 2024 https://arxiv.org/abs/2310.06839 ↩
-
Microsoft Research, "LLMLingua: Compressing Prompts for Accelerated Inference" — EMNLP 2023 https://github.com/microsoft/LLMLingua ↩
-
Mu et al., "Learning to Compress Prompts with Gist Tokens" — NeurIPS 2023 https://arxiv.org/abs/2304.08467 ↩
-
PCToolkit Repository https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression ↩