pip install red-specter-janus
Every AI vendor ships guardrails. Content filters. Refusal mechanisms. Safety classifiers. They publish safety cards. They claim the model is aligned. And none of it has been tested under adversarial conditions. You deployed a guardrail you never validated. JANUS validates it.
A firewall enforces stateful packet rules with cryptographic certainty. A guardrail is a probabilistic classifier trained on a finite dataset. It can be fooled by any input outside its training distribution. Base64-encoded payloads, homoglyph substitutions, zero-width character injections -- none of these exist in typical guardrail training data.
Vendor safety testing uses benign evaluation sets. They measure refusal rates against obvious harmful prompts. They never test persona switches, encoding evasion, many-shot context flooding, crescendo multi-turn escalation, or payload splitting across messages. The gap between vendor testing and real-world attacks is enormous.
Most deployments use a single guardrail layer -- Lakera Guard, Llama Guard, OpenAI Moderation, or Azure Content Safety. A single layer means a single bypass surface. One persona switch. One encoding trick. One context manipulation. If it gets past that one layer, there is nothing else. No defence in depth.
You inherited a deployment. You acquired a company. You are auditing a third-party system. You do not know which guardrail is deployed. Is it Lakera? NeMo Guardrails? Llama Guard? Anthropic constitutional AI? Each has different bypass profiles. You need to fingerprint before you can test. JANUS fingerprints.
Your regulator asks: "Prove your AI safety controls are effective." You have no evidence. No bypass testing results. No cryptographically signed reports. No SHA-256 evidence chains. No SIEM-integrated audit trail. JANUS generates all of this. Every finding is Ed25519-signed and hash-chained.
Individual bypass techniques have moderate success rates. But chained sequences -- ROT13 encode, then persona switch, then deceptive delight camouflage, then many-shot context flood -- achieve 80% effectiveness against all guardrail types. If you have not tested chains, you have not tested your guardrail.
Six subsystems. Each one targets a different aspect of guardrail security. Fingerprint identifies the guardrail. Bypass exploits known weaknesses. Encoder evades keyword filters. Fuzzer discovers zero-day bypasses. Chainer combines techniques into multi-step campaigns. Reporter generates the signed evidence.
Identify and classify the deployed guardrail system. 7 fingerprint probes (JFING-001 through JFING-007) test against Lakera Guard, NeMo Guardrails, Llama Guard, OpenAI Moderation, Azure Content Safety, Anthropic Constitutional AI, custom implementations, and unguarded targets. Response signature pattern matching with confidence scoring.
Comprehensive library of proven guardrail bypass techniques: persona switch, encoding evasion, context manipulation, multi-turn crescendo, payload splitting, semantic disguise, token manipulation, and meta-prompt attacks. Each technique has success indicators, failure indicators, and effectiveness ratings from 15% to 70%.
Transform payloads to evade content filters. Base64, ROT13, zero-width Unicode characters, homoglyph substitution (Cyrillic lookalikes), leetspeak, pig latin, string reversal, and character splitting. Each encoding includes decoder instructions for the target model. Automatic encoding detection for defensive assessment.
Automated discovery of novel guardrail bypasses through systematic prompt mutation. Case variation, whitespace injection, synonym substitution, delimiter insertion, prefix injection, suffix injection, Unicode substitution, word order shuffle, negation flip, and context wrapping. Configurable mutations per type. Generates JFUZZ-prefixed candidates.
Combine multiple bypass techniques into multi-step attack sequences. Encode-then-roleplay (55%). Crescendo with encoding finale (75%). Deceptive delight with homoglyphs (65%). Split-translate-recombine (60%). Many-shot with token manipulation (70%). Full stack evasion -- ROT13 + persona + deceptive delight + many-shot (80%).
Orchestrates all five attack subsystems into a unified guardrail assessment. Generates risk scores (0-100), A+ to F grading with 13 grade thresholds, severity breakdown (CRITICAL/HIGH/MEDIUM/LOW/INFO), subsystem breakdown, bypass rate calculation, and overall guardrail resistance rating from STRONG to CRITICAL.
Every technique is catalogued with an ID, category, severity rating, target guardrails, effectiveness rating, prompt template, and both success and failure indicators. This is not a wish list -- it is an attack library built from real-world guardrail research.
| ID | Technique | Category | Severity | Effectiveness | Targets |
|---|---|---|---|---|---|
| JBYP-001 | Developer Mode | persona_switch | HIGH | 15% | Lakera, NeMo, Custom |
| JBYP-002 | Roleplay Expert | persona_switch | HIGH | 25% | Lakera, OpenAI, Custom |
| JBYP-003 | Base64 Encoded Payload | encoding_evasion | CRITICAL | 40% | Lakera, Llama Guard, OpenAI |
| JBYP-004 | ROT13 Encoding | encoding_evasion | HIGH | 35% | Lakera, Llama Guard |
| JBYP-005 | Many-Shot Context Window | context_manipulation | CRITICAL | 65% | All guardrails |
| JBYP-006 | Crescendo Multi-Turn | multi_turn | CRITICAL | 70% | All guardrails |
| JBYP-007 | Payload Splitting | payload_splitting | HIGH | 45% | Lakera, OpenAI |
| JBYP-008 | Translation Evasion | semantic_disguise | HIGH | 55% | Lakera, OpenAI, Llama Guard |
| JBYP-009 | Token Boundary Manipulation | token_manipulation | HIGH | 30% | Lakera, Llama Guard |
| JBYP-010 | Deceptive Delight | semantic_disguise | CRITICAL | 60% | All guardrails (Unit 42) |
Content filters match keywords. Encoders destroy keywords. Every encoding type produces a transformed payload that looks nothing like the original but instructs the model to decode and follow. JANUS also detects encoded payloads defensively -- base64 blocks, zero-width clusters, and Cyrillic-Latin homoglyph mixing.
Known techniques get patched. The fuzzer discovers new ones. It takes a seed prompt that the guardrail blocks, then systematically mutates it until a variant slips through. Each mutation type is applied 3 times by default. 9 mutation types times 3 variants equals 27 fuzz candidates per seed prompt.
Individual techniques have moderate success rates. Chains compound them. Each chain applies techniques in a specific order to defeat layered guardrail stacks. The Full Stack Evasion chain achieves 80% expected effectiveness against all guardrail types.
| ID | Chain | Steps | Effectiveness | Targets |
|---|---|---|---|---|
| JCHN-001 | Encode Then Roleplay | Base64 encode + roleplay expert persona | 55% | Lakera, OpenAI |
| JCHN-002 | Crescendo + Encoding Finale | Multi-turn escalation + base64 final payload | 75% | All guardrails |
| JCHN-003 | Deceptive Delight + Homoglyphs | Camouflage technique + homoglyph-obfuscated keywords | 65% | Lakera, Llama Guard |
| JCHN-004 | Split, Translate, Recombine | Payload splitting + language translation + recombine | 60% | Lakera, OpenAI, Llama Guard |
| JCHN-005 | Many-Shot + Token Manipulation | Context flood with 100+ examples + zero-width tokens | 70% | All guardrails |
| JCHN-006 | Full Stack Evasion | ROT13 + persona switch + deceptive delight + many-shot | 80% | All guardrails |
JANUS is a command-line tool. Built on Typer and Rich. 7 commands: fingerprint, bypass, encoder, fuzzer, chainer, scan, and report. Every command accepts --target and --verbose. Full scan runs all six subsystems in sequence.
Fingerprint the guardrail system on a target endpoint:
Run the 10 bypass techniques against a target:
Run the full guardrail assessment -- all 6 subsystems, signed report:
Every JANUS assessment produces a cryptographically signed evidence chain. Each finding is appended to a SHA-256 hash chain where every entry references the previous hash. The final report is Ed25519-signed with the operator's private key. Tamper with any entry and the chain breaks. This is not a PDF -- it is forensic evidence.
JANUS does not operate alone. It is the guardrail bypass specialist in a three-tool AI safety attack chain. JANUS finds the bypass. SERPENT exploits the reasoning chain once past the guardrail. HARBINGER validates the end-to-end attack path. Together they prove whether your AI safety stack holds under adversarial conditions.
JANUS outputs structured JSON that maps directly to SIEM ingestion pipelines. Every finding includes severity, category, timestamp, payload, response, and remediation guidance. The evidence chain provides tamper-proof audit trails. Feed the output into Splunk, Sentinel, Elastic, QRadar, or any CEF/JSON-compatible SIEM.
Every finding is a structured JSON object with finding_id, test_name, category, severity, score, grade, payload_used, response, description, remediation, tool_name, subsystem, and timestamp.
CRITICAL, HIGH, MEDIUM, LOW, INFO severity levels map directly to SIEM alert priorities. Weight-based scoring: CRITICAL=10, HIGH=7, MEDIUM=4, LOW=2, INFO=0.5.
Findings categorised by attack type: GUARDRAIL_FINGERPRINT, GUARDRAIL_BYPASS, GUARDRAIL_ENCODING_EVASION, GUARDRAIL_FUZZ_BYPASS, GUARDRAIL_CHAIN_BYPASS, GUARDRAIL_MISSING.
SHA-256 evidence chain with Ed25519 signatures. Each entry references the previous hash. Immutable append-only log. Verify integrity with a single function call.
Every finding tags which subsystem produced it: fingerprinter, bypass, encoder, fuzzer, or chainer. Enables per-subsystem dashboards and alert routing.
Every finding, every evidence chain entry, and every report signature carries an ISO 8601 UTC timestamp. Precise temporal correlation across your security stack.
JANUS fingerprints and attacks 7 guardrail implementations. Each guardrail has unique response signatures, bypass profiles, and weakness patterns. The fingerprinter identifies the guardrail type with confidence scoring, then the bypass engine selects the most effective techniques for that specific implementation.
Standard mode detects and maps guardrails. UNLEASHED mode actively exploits them. Ed25519 cryptographic dual-gate. One private key. One operator. The key never leaves the founder's machine. Every UNLEASHED execution is signed and logged to the evidence chain.
Maps guardrail implementations. Identifies safety mechanism types and vendors. Runs fingerprint probes. Reports bypass surface area without attempting exploitation. Safe for initial assessment.
Plans full guardrail bypass campaigns. Shows exactly which techniques would work against the identified guardrail. Calculates expected effectiveness. Ed25519 key required. No actual bypass execution.
Cryptographic override. Private key controlled. Executes all bypass techniques, encoding evasion, fuzzer mutations, and multi-technique chains against live targets. One operator. Founder's machine only. Every action signed.
THIS TOOL IS FOR AUTHORISED SECURITY TESTING ONLY. EVERY EXECUTION IS SIGNED AND LOGGED.
JANUS is intended for authorised security testing only. Unauthorised use against systems you do not own or have explicit permission to test is illegal and unethical. Always obtain written authorisation before conducting any guardrail security assessments. Every execution is cryptographically signed with Ed25519 and logged to an immutable SHA-256 evidence chain. Red Specter Security Research Ltd accepts no liability for unauthorised use.
6 subsystems. 10 bypass techniques. 8 encoding types. 9 mutation types. 6 bypass chains. 73 tests. Ed25519-signed evidence. The tool that proves your AI safety mechanisms are not safe.