FORGE

Automated LLM security testing — before you build an agent. Test the model. Not the pipeline. The model.
10
Attack Tools
1,590
Base Payloads
5,340+
With Mutations
9,298
Tests Passing
pip install red-specter-forge
You test agents / But never test the model underneath / Jailbreaks ship to production / Policy compliance is a guess / Drift goes undetected for months / Boundary thresholds are unknown / Model substitution is invisible / Regression testing doesn't exist / You trusted the vendor's safety card You test agents / But never test the model underneath / Jailbreaks ship to production / Policy compliance is a guess / Drift goes undetected for months / Boundary thresholds are unknown / Model substitution is invisible / Regression testing doesn't exist / You trusted the vendor's safety card

Nobody Tests the Model

Everyone tests the agent. Nobody tests the LLM it's built on. You're deploying models with unknown jailbreak resistance, unmeasured policy compliance, invisible drift behaviour, and boundary thresholds you've never mapped. You're flying blind at the foundation layer.

Unknown Injection Surface

Your model has never been tested against systematic prompt injection campaigns. You don't know which injection classes it's vulnerable to — direct, indirect, token smuggling, context overflow, goal hijacking, multi-turn, or rule inversion.

Unmeasured Jailbreak Resistance

DAN variants, persona hijacking, hypothetical framing, Socratic extraction — 70+ documented jailbreak techniques exist. You don't know which ones break your model because you've never systematically tested.

Policy Is a Hope, Not a Number

You set safety policies but never measured violation rates statistically. Without Wilson score confidence intervals across 1,000+ test calls, your "policy compliance" is anecdotal — not empirical.

Invisible Boundary Cliff

Every model has a boundary where it transitions from compliance to refusal. You've never mapped it. You don't know your model's exact severity threshold — or what happens at the cliff edge.

Drift Goes Undetected

Over long sessions, models drift. Cosine similarity degrades. Toxicity creeps up. Policy violations increase. Without multi-turn drift measurement, you'll never see it happening.

No Regression Testing

When the vendor pushes a model update, you have no way to know if the new version is weaker than the old one. No two-proportion z-tests. No paired t-tests. No statistical proof. Just hope.

The FORGE Armoury

Ten tools. Each one attacks a different surface of the LLM. Each one produces structured JSON consumed by the report builder. Each finding maps to OWASP LLM Top 10 2025. Each finding generates an AI Shield blocking rule.

# Tool Command What It Does
01 Inject Scan forge inject scan 80 payloads across 8 injection classes. Direct, indirect, token smuggling, context overflow, goal hijacking, multi-turn deception, rule inversion, multimodal. Mutation engine generates 2,000+ variants.
02 Jailbreak Scan forge jailbreak scan 70 payloads across 7 categories. DAN variants, persona hijack, hypothetical framing, obfuscation, multi-step chaining, Socratic extraction, temporal drift. Adaptive mutation on resistance.
03 Output Scan forge output scan 140 payloads forcing PII extraction, unsafe content generation, and exfiltration simulation. Regex PII detection, toxicity scoring, code exfiltration pattern analysis.
04 Policy Scan forge policy scan 1,000 adversarial prompts across 5 categories. Wilson score confidence intervals on violation rates. Stratified by category, toxicity, severity. Finds exact policy breakdown conditions.
05 Drift Scan forge drift scan 10 conversation sequences over configurable turns. Cosine similarity drift, toxicity drift, KS test for distribution changes, change-point detection. Finds when the model stops being itself.
06 Boundary Scan forge boundary scan 100 payloads across 5 severity levels. Adaptive binary search for the exact compliance cliff edge. Boundary score 0–100. Produces a boundary curve with statistical backing.
07 Compare Scan forge compare scan Identical campaigns against multiple models. Temperature locked to 0. Chi-square significance testing. Comparative security posture table. Tells you which model is weakest.
08 Regression Scan forge regression scan Two model versions. Two-proportion z-test on violation rates. Paired t-test on continuous scores. Cohen's h effect sizes. Tells you if the update weakened security.
09 Supply Scan forge supply scan 200 behavioural probes across 4 categories. Fingerprints the model. Flags if it's not what it claims — tampered, substituted, or fine-tuned. Reports confidence honestly.
10 Report Build forge report build Aggregates all tool outputs. OWASP LLM Top 10 2025 mapping. A–F grading. Ed25519 signed. RFC 3161 timestamped. AI Shield policy file output. JSON + HTML.

One Command. Every Surface.

Run every offensive tool in sequence, then build a unified signed report:

$ forge full-scan --target https://api.openai.com --api-key sk-xxx --model gpt-4
[INJECT] Running inject scan...
  12 vulnerabilities found across 8 injection classes
[JAILBREAK] Running jailbreak scan...
  4 jailbreaks successful — DAN 11.0, Socratic extraction
[OUTPUT] Running output scan...
  3 PII leaks, 0 exfiltration
[POLICY] Running policy scan — 1,000 calls...
  Violation rate: 2.4% [1.6%, 3.5%] 95% CI
[DRIFT] Running drift scan — 10 × 100 turns...
  KS test: p=0.003 — significant drift detected
[BOUNDARY] Running boundary scan...
  Boundary score: 62/100 — cliff at Level 3

SCAN COMPLETE | Risk Grade: D | 19 findings | Report signed ✓
  JSON: reports/forge-full-scan-2026-03-10.json
  HTML: reports/forge-full-scan-2026-03-10.html

Adaptive Escalation

If the model resists, FORGE escalates. Mutations, encoding, multi-step chains — it keeps pushing until it breaks or exhausts the library.

Statistical Rigour

Wilson score CIs, KS tests, z-tests, t-tests, Cohen's h. Not vibes — mathematics. Every claim backed by statistical significance.

Ed25519 Signed

Every report cryptographically signed with Ed25519. RFC 3161 timestamped. SHA-256 evidence chains. Tamper-evident by design.

AI Shield Integration

Every finding generates an AI Shield blocking rule. FORGE findings become runtime protection. One pipeline from testing to production.

10
Attack Tools
1,590
Static Payloads
5,340+
With Mutations
9,298
Tests Passing
0
Failures

25 Variants Per Payload

Every offensive tool ships with a 5-category mutation engine. If the base payload fails, FORGE mutates it — encoding, obfuscation, semantic rewriting, structural wrapping, and evasion techniques. 150 base attack payloads become 3,750+ mutation variants. The model doesn't get to see the same payload twice.

Encoding

  • Base64
  • Hex encoding
  • ROT13
  • URL encoding
  • HTML entities

Obfuscation

  • L33tspeak
  • Unicode homoglyphs
  • Zero-width chars
  • Character doubling
  • Whitespace injection

Semantic

  • Synonym substitution
  • Passive voice
  • Question-to-statement
  • Negation inversion
  • Academic framing

Structural

  • Markdown wrapping
  • Code block wrapping
  • JSON embedding
  • XML wrapping
  • List formatting

Evasion

  • Language mixing
  • Character splitting
  • Reverse text
  • Pig latin
  • Payload fragmentation

Ten Tools. Every Layer. No Gaps.

FORGE is Stage 1 of the Red Specter offensive pipeline. Test the model before you build with it. Findings feed directly into AI Shield as runtime blocking rules and into redspecter-siem for enterprise SIEM correlation.

Stage 1 — LLM Testing
FORGE
Test the model before you build with it
Stage 2 — Agent Testing
ARSENAL
Test the AI agent during development
Stage 3 — Swarm Assault
PHANTOM
Coordinated AI agent swarm assault
Stage 4 — Web Siege
POLTERGEIST
Coordinated web application siege
Stage 5 — Traffic Interception
GLASS
Watch the wire
Stage 6 — Adversarial AI
NEMESIS
Think like the attacker
Stage 7 — Human Layer
SPECTER SOCIAL
Target the human
Stage 8 — OS/Kernel
PHANTOM KILL
Own the foundation
Stage 9 — Physical Layer
GOLEM
Attack the physical layer
Stage 10 — Supply Chain
HYDRA
Attack the trust chain
Discovery & Governance
IDRIS
Discover and govern AI assets
Defence
AI Shield
Defend everything above it
SIEM Integration
redspecter-siem
Findings feed directly into Splunk, Sentinel, QRadar

1,590 Static. 5,340+ Total.

80
Injection Payloads
70
Jailbreak Payloads
140
Output Safety
1,000
Policy Test Prompts
100
Boundary Probes
200
Supply Chain Probes

Every Finding Mapped

10 / 10

OWASP LLM Top 10 — 2025

  • LLM01 Prompt Injection
  • LLM02 Sensitive Information Disclosure
  • LLM03 Supply Chain
  • LLM04 Data and Model Poisoning
  • LLM05 Improper Output Handling
  • LLM06 Excessive Agency
  • LLM07 System Prompt Leakage
  • LLM08 Vector and Embedding Weaknesses
  • LLM09 Misinformation
  • LLM10 Unbounded Consumption
Cryptographic

Report Integrity

  • Ed25519 digital signatures
  • SHA-256 evidence chains
  • RFC 3161 timestamps
  • Tamper-evident by design
  • AI Shield policy generation
  • Machine-ingestible JSON output
Statistical

Mathematical Rigour

  • Wilson score confidence intervals
  • Kolmogorov-Smirnov distribution tests
  • Two-proportion z-tests
  • Paired t-tests
  • Cohen's h effect sizes
  • Chi-square significance testing

Security Distros & Package Managers

Kali Linux
.deb package
Parrot OS
.deb package
BlackArch
PKGBUILD
REMnux
.deb package
Tsurugi
.deb package
PyPI
pip install
macOS
pip install
Windows
pip install
Docker
docker pull

Authorised Use Only

Red Specter FORGE is intended for authorised security testing only. Unauthorised use against systems you do not own or have explicit permission to test may violate the Computer Misuse Act 1990 (UK), Computer Fraud and Abuse Act (US), and equivalent legislation in other jurisdictions. Always obtain written authorisation before conducting any security assessments. Apache License 2.0.

Pure Engineering
Zero External Tools. Zero Wrappers.

Most security testing frameworks are menus that shell out to existing tools behind a terminal UI. FORGE is actual engineering. Every payload, every mutation, every detection algorithm, every scoring engine — written from scratch in pure Python. Zero subprocess calls. Zero external tool dependencies.

1,590
Custom Payloads
25
Mutation Variants
0
Subprocess Calls
0
External Dependencies
Enterprise Integration
Enterprise SIEM Integration — Native

Export every finding directly to your SIEM. One flag. Native format translation. Ed25519 signatures and RFC 3161 timestamps preserved across every export.

Splunk
HEC • CIM Compliant
Sentinel
CEF • Log Analytics API
QRadar
LEEF 2.0 • Syslog
forge full-scan --target http://localhost:11434 --model llama3 --export-siem splunk
Ed25519 Cryptographic Override
FORGE UNLEASHED

Cryptographic override. Private key controlled. One operator. Founder's machine only.