RustMizan

A compilable, contamination-aware benchmarking framework for Rust vulnerability analysis.

Get started · Paper · GitHub · Vanilla Dataset · Leaderboard · Trajectories · Analysis

RustMizan (Mizan - Arabic for "scale" or "balance") evaluates both traditional and LLM-based vulnerability analysis techniques in Rust. It pairs a curated dataset of real-world vulnerabilities with the infrastructure to evaluate them.

The dataset is a curated set of real-world memory-safety CVEs, each packaged as compilable variants at the crate, file, and function levels. Every variant ships with ground-truth annotations for four tasks: Crate Vulnerability Classification (CVC), CWE classification, function localization, and line localization.

RustMizan overview

Design principles

Fully compilable. Every variant compiles, so it can be analyzed by traditional tools (static analyzers, formal verification) and explored by agents that build and run the code. See the Dataset.
Multi-level context. Each vulnerability is available at crate, file, and function levels, so you can study how context granularity affects analysis.
Contamination-aware. A pluggable mutation framework applies semantic-preserving transformations that change syntax while preserving the vulnerability, so you can probe memorization versus reasoning.
Extensible. Adding a vulnerability or a mutation is a small, well-defined task. See Contributing.
Transparent. Every evaluation run is published as a complete agent trajectory (prompts, reasoning, tool calls, and scoring), browsable in an Inspect log viewer and linked from each result on the Leaderboard. Every run is also analyzed automatically with Docent for contamination signals.

How it compares

Most vulnerability benchmarks use non-compilable snippets, fix a single context level, focus on binary detection, and rarely handle contamination or target Rust. RustMizan combines all of these in one benchmark: compilable variants, the same vulnerability at multiple context levels, the full analysis pipeline (CVC, CWE classification, and function- and line-level localization), built-in contamination and robustness testing, and a focus on Rust.

Where to go next

If you want to...	Read
Install and run the full pipeline	Getting Started
Understand the dataset and its layout	Dataset
Use the `mizan` command-line tool	The mizan CLI
Learn the mutations and how they preserve ground truth	Mutations
See how models are scored	Evaluation
Read or submit results	Leaderboard
See how runs are analyzed for contamination	Trajectory analysis
Add a vulnerability, a mutation, or results	Contributing

Citation

@misc{elsayed2026rustmizancompilablecontaminationawarebenchmarking,
title={RustMizan: A Compilable, Contamination-Aware Benchmarking Framework for Rust Vulnerabilities},
author={Tarek Elsayed and Shiping Yang and Eunsong Koh and Sanika Goyal and Vincent Huang and Paul Ngo and Nathan Young and Mohammad Omidvar Tehrani and Alvyn Kang and Arnell Kang and Zeyu Chen and Angélica Moreira and Xuan Feng and Angel X. Chang and Nick Sumner and Steven Y. Ko},
year={2026},
eprint={2607.04729},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2607.04729},
}

Acknowledgements

This work is done at the Reliable Systems Lab at Simon Fraser University, led by Dr. Steven Ko.

Licensed under the Apache License, Version 2.0.

Getting Started

Setup and a complete run, from building the dataset to viewing evaluation results.

Requirements

A nightly Rust toolchain. mizan-mut depends on rust-analyzer crates that need nightly features.
Poetry for the Python CLI.
Docker, used by the evaluation harness to sandbox each sample.

Get the code

Clone the repository; everything below runs from its root.

git clone https://github.com/sfu-rsl/rust-mizan.git
cd rust-mizan

Build the dataset

All variants are members of one Cargo workspace. Build them with:

cargo +nightly build --workspace

Install the CLI

cd mizan-cli
poetry install

# Run mizan through poetry
poetry run mizan checkout --help

# Or add it to your PATH
export PATH="$(poetry env info --path)/bin:$PATH"

All mizan commands run from a directory that contains mizan.json (the dataset root).

End-to-end run

# 1. Select samples into an output directory
mizan checkout -v vuln-0001 -v vuln-0002 -l function -o output
cd output

# 2. Apply semantic-preserving mutations (optional)
mizan mutate -m remove-comments

# 3. Convert to a parquet dataset for evaluation
mizan evaluate prepare-dataset --tag comments_removed -o mizan_comments_removed.parquet

# 4. Run the evaluation (edit mizan-cli/run_eval.py with your dataset path and config)
python ../mizan-cli/run_eval.py

# 5. View results
inspect view

Each step is documented in detail:

The mizan CLI covers checkout, mutate, and evaluate prepare-dataset.
Mutations lists every mutation and explains ground-truth tracking.
Evaluation describes the task, the metrics, and how to configure a run.

Dataset

RustMizan focuses on Rust memory-safety vulnerabilities: use-after-free, buffer overflow, double free, and related issues. Every variant traces back to a publicly disclosed CVE. The benchmark is built on real vulnerabilities, not synthetic or injected ones.

Multi-level compilable variants

Each CVE is packaged as up to three standalone compilable crates of decreasing scope.

Multi-level variants

Crate level: the full original project.
File level: the vulnerable file plus the files and type definitions needed to compile, packaged as a standalone crate.
Function level: just the vulnerable function and its compile dependencies.

The same vulnerability appears at all three levels, so any difference in analysis accuracy is due to context, not to the vulnerability being harder or easier. Two exceptions apply: single-file crates get only file- and function-level variants, and the function level is skipped when the file is essentially a single function.

Sourcing

The dataset draws from the RustSec Advisory Database, a community-maintained repository of security advisories for Rust crates. Each RustSec entry is mapped to its individual CVE.

Vulnerable version: the commit before the fix, or the version immediately before the patched release.
Patched version: the commit corresponding to the patched release from RustSec. When no official patch is recorded, only the vulnerable variant is included.

All variants are constructed manually and verified to compile. Annotations are derived from CVE descriptions, GitHub issue discussions, commit messages, and code review, and every annotation is peer reviewed by at least one additional researcher.

Directory layout

samples/
├── deps/                      # shared dependency crates
├── vuln-0001/
│   ├── README.md              # CVE description and vulnerability explanation
│   ├── sample-00001-crate/    # vulnerable, crate level
│   ├── sample-00001-file/     # vulnerable, file level
│   ├── sample-00001-function/ # vulnerable, function level
│   ├── sample-10001-crate/    # fixed, crate level
│   ├── sample-10001-file/     # fixed, file level
│   └── sample-10001-function/ # fixed, function level
└── ...

Naming convention

The convention is clear to developers but not immediately obvious to LLMs.

Vulnerable samples: sample-0XXXX-level (first digit 0).
Fixed samples: sample-1XXXX-level (first digit 1).
XXXX is the 4-digit vulnerability ID. level is function, file, or crate.

For example, sample-00042-crate is the vulnerable crate-level variant of vuln-0042, and sample-10042-crate is its fixed counterpart.

`mizan.json`

mizan.json at the dataset root holds the ground truth. Its top level has general_information (benchmark name, rust version, dataset version) and a list of vulnerabilities.

Each vulnerability records its id, crate_name, year, source link, and a list of code_samples. Each code sample has:

Field	Type	Meaning
`path_to_crate`	string	Path relative to `samples/`, e.g. `vuln-0001/sample-00001-function`
`is_vulnerability`	bool	`true` for vulnerable samples, `false` for fixed
`cwe_type`	list of strings	CWE identifiers, e.g. `["CWE-416"]`
`vulnerable_functions`	map	File path to the list of vulnerable function signatures
`vulnerable_lines`	map	File path to the list of vulnerable line numbers (1-indexed)
`deps`	list of strings	Dependency crate names from `samples/deps/` (empty if none)

The level (granularity) is derived from path_to_crate.

Dependencies

Some samples depend on other crates from the original project's workspace. Those dependency crates live in samples/deps/, and each sample lists the ones it needs in its deps field. mizan checkout copies the referenced dependencies alongside the samples.

To add a vulnerability to the dataset, see Add a vulnerability.

The mizan CLI

mizan is the Python CLI for working with the dataset. It selects samples, applies mutations, and prepares datasets for evaluation.

All commands run from a directory containing mizan.json (the dataset root).

Installation

cd mizan-cli
poetry install
export PATH="$(poetry env info --path)/bin:$PATH"

Configuration

Optional configuration lives at ~/.config/mizan/config.json:

Option	Description	Default
`log_level`	`DEBUG`, `INFO`, `WARNING`, or `ERROR`	`INFO`
`log_file`	Path to a log file	none

`checkout`

Select and export samples from the dataset into an output directory.

mizan checkout [OPTIONS]

Option	Short	Description	Default
`--output`	`-o`	Output directory	`./output`
`--level`	`-l`	`function`, `file`, `crate`, or `all`	`all`
`--vuln-ids`	`-v`	Specific vulnerability IDs (repeatable)	none
`--year`	`-y`	Filter by year	none
`--cwe-types`	`-c`	Filter by CWE type (repeatable)	none
`--include-fixed`		Include fixed samples too	`false`

# All function-level samples
mizan checkout --level function

# Two specific vulnerabilities
mizan checkout -v vuln-0001 -v vuln-0002

# Combine filters
mizan checkout --level function --year 2019 --cwe-types CWE-416 -o ./my-samples

checkout copies the selected samples and any dependencies they need, writes a workspace Cargo.toml, and emits a filtered mizan.json into the output directory.

`mutate`

Apply semantic-preserving mutations to checked-out samples. Run it from inside the checkout output directory.

cd output
mizan mutate [OPTIONS]

Option	Short	Description	Default
`--mutations`	`-m`	Mutations to apply (repeatable)	`all`
`--seed`	`-s`	Random seed for reproducibility	`42`

# A single mutation
mizan mutate -m remove-comments

# Several, applied in order
mizan mutate -m format-compact -m benign-comments

The full list of mutations, their categories, and ordering caveats are on the Mutations page. mutate updates mizan.json with corrected line numbers and writes a mizan_mutations.json log.

`evaluate prepare-dataset`

Convert checked-out samples into a parquet file for evaluation. Run it from the output directory.

mizan evaluate prepare-dataset [OPTIONS]

Option	Short	Description	Default
`--output`	`-o`	Output parquet file	`dataset.parquet`
`--tag`	`-t`	Optional tag to identify the dataset	none

The parquet bundles each sample's files and ground truth, plus dataset metadata (rust version, tag, applied mutations). It is the only artifact the evaluation harness consumes. See Evaluation.

Running evaluations

Use the run_eval.py script for full control over models, limits, and the agent scaffold:

cd mizan-cli
# Edit run_eval.py: dataset path, models, message/time limits
python run_eval.py

The script exposes the full evaluation configuration, including the agent, which can be replaced with a custom implementation. See Evaluation.

Mutations

RustMizan pairs the dataset with an extensible mutation framework. Every mutation is semantically preserving: it changes code syntax without altering program behavior, so the underlying vulnerability is intact but its surface form differs.

Mutations serve two purposes. Contamination mutations break token-level memorization to test whether a model recalls a benchmark rather than reasoning about it. Robustness mutations inject misleading cues to test whether a model resists surface-level deception.

For the before/after form of each mutation, see Mutation specification. For the underlying Rust AST tool, see mizan-mut.

Mutation	Description
`remove-comments`	Remove all Rust comments
`format-compact`	Apply compact `rustfmt` formatting
`format-expanded`	Apply expanded `rustfmt` formatting
`mizan-mut-for-to-while`	Convert `for` loops to `while` loops
`mizan-mut-while-to-loop`	Convert `while` loops to `loop` blocks with breaks
`mizan-mut-if-else-reorder`	Reorder if-else branches by negating conditions
`benign-comments`	Insert neutral comments around vulnerable lines
`benign-blocks`	Insert neutral code blocks around vulnerable lines
`benign-rename-fn`	Rename functions to neutral names (e.g. `fn_1_abc123`)
`benign-rename-var`	Rename variables to neutral names (e.g. `var_1_xyz789`)

Mutation	Description
`malignant-comments`	Insert comments falsely suggesting the code is safe
`malignant-blocks`	Insert code blocks falsely suggesting safety
`malignant-rename-fn`	Rename functions to safety-implying names (e.g. `safe_fn_1`)
`malignant-rename-var`	Rename variables to safety-implying names (e.g. `secure_var_1`)

Mutation	Description
`derive-reorder`	Reorder traits in `#[derive(...)]` attributes
`trait-bound-reorder`	Reorder trait bounds in `where` clauses
`use-reorder`	Reorder items in `use` statements
`arithmetic-identity`	Wrap integer literals with a multiplication identity (`N * 1`)
`explicit-where`	Add an explicit `where` clause to a signature
`explicit-where-to-type-params`	Move simple type bounds from a `where` clause into the type parameters
`rename-lifetime`	Rename lifetime parameters consistently
`impl-trait-to-generic`	Convert `impl Trait` bounds into generic parameters
`option-wrap`	Wrap expressions in a redundant `Some(...).unwrap()`
`maybeuninit-wrap`	Round-trip a value through `MaybeUninit<T>`
`manuallydrop-wrap`	Wrap an owned variable in `ManuallyDrop`, then unwrap it
`explicit-return`	Convert implicit returns to explicit `return` statements
`unreachable-panic`	Guard a function body with an unreachable `panic!()` arm
`repeated-shadowing`	Add redundant repeated shadows for `let` bindings

The pipeline

For each sample, the framework backs up the original, applies the mutation, then validates that the result still compiles and that the ground truth is preserved. If any step fails, it rolls back to the backup. Successful mutations are saved; the rest are logged.

Mutation pipeline

Ground-truth tracking

Mutations change the ground truth: renaming a function invalidates annotations that reference it by name, and inserting code shifts line numbers. The framework keeps annotations accurate with three mechanisms.

Marker tracking. For most mutations, a unique comment marker (e.g. // MIZAN_MARKER_vuln0001) is inserted before each vulnerable line. After the mutation, the marker's new position gives the corrected line number, and the marker is removed.
Content-based tracking. AST-based mizan-mut-* mutations remove all comments (including markers) when they parse and regenerate the code, so vulnerable lines are tracked by their content instead. If a line appears multiple times or cannot be found after mutation, that file is excluded and the mutation is re-applied. Such cases are recorded as partial_mutations.
Rename tracking. Rename mutations legitimately change line content, so the validator allows content differences for them.

Ground-truth tracking

Output files

Updated mizan.json with corrected vulnerable line numbers.
mizan_mutations.json logging mutations_applied, skipped (mutations or samples that were skipped), and partial_mutations.

A "successful" mutation means the process completed without error, not that code necessarily changed. Applying for-to-while to code with no for loops succeeds without making changes.

Ordering caveats

Mutations are applied in the order you list them. Be deliberate:

Don't run for-to-while then while-to-loop unless you intend to turn for loops into loop blocks.
Don't run benign-comments then remove-comments; the inserted comments will be stripped.

To add a new mutation, see Add a mutation.

Mutation specification

The before/after form of each mutation. All mutations are semantically preserving.

The mutations below are exactly those available through mizan mutate (see the registry on the Mutations overview).

Contamination

`remove-comments`

Removes all Rust comments (line, block, and doc), stripping natural-language hints a model may have memorized.

#![allow(unused)]
fn main() {
// SAFETY: caller must ensure idx < buf.len()
pub fn read_byte(buf: &[u8], idx: usize) -> u8 {
    /* fast path, no bounds check */
    unsafe { *buf.get_unchecked(idx) }
}
}

becomes

#![allow(unused)]
fn main() {
pub fn read_byte(buf: &[u8], idx: usize) -> u8 {
    unsafe { *buf.get_unchecked(idx) }
}
}

`format-compact`

Reformats the crate with a compact rustfmt profile (fewer blank lines, tighter braces).

#![allow(unused)]
fn main() {
pub fn add(
    a: i32,
    b: i32,
) -> i32 {
    a + b
}
}

becomes

#![allow(unused)]
fn main() {
pub fn add(a: i32, b: i32) -> i32 { a + b }
}

`format-expanded`

The inverse: an expanded rustfmt profile that adds vertical whitespace and splits signatures across lines.

`mizan-mut-for-to-while`

Rewrites for loops into while let loops driven by an explicit iterator.

#![allow(unused)]
fn main() {
for item in collection.iter() {
    process(item);
}
}

becomes

#![allow(unused)]
fn main() {
let mut __iter = collection.iter();
while let Some(item) = __iter.next() {
    process(item);
}
}

`mizan-mut-while-to-loop`

Rewrites while cond { body } into a loop with an early break.

#![allow(unused)]
fn main() {
while i < n {
    sum += i;
    i += 1;
}
}

becomes

#![allow(unused)]
fn main() {
loop {
    if !(i < n) { break; }
    sum += i;
    i += 1;
}
}

`mizan-mut-if-else-reorder`

Swaps the then and else branches and negates the condition.

#![allow(unused)]
fn main() {
if x > 0 { handle_positive(x); } else { handle_non_positive(x); }
}

becomes

#![allow(unused)]
fn main() {
if !(x > 0) { handle_non_positive(x); } else { handle_positive(x); }
}

`benign-comments`

Inserts neutral, process-oriented comments (refactor notes, TODOs) around each vulnerable line.

#![allow(unused)]
fn main() {
pub fn read_byte(buf: &[u8], idx: usize) -> u8 {
    // TODO: Function could be further refactored as part of cleaning up the codebase.
    unsafe { *buf.get_unchecked(idx) }
}
}

`benign-blocks`

Inserts neutral, compile-validated helper blocks around vulnerable lines. Each candidate is checked with cargo check, and the first that compiles cleanly is kept.

`benign-rename-fn`

Renames functions within a +/-10-line window around each vulnerable line to neutral names like fn_1_abc123, using rust-analyzer so all call sites update.

#![allow(unused)]
fn main() {
pub fn read_byte(buf: &[u8], idx: usize) -> u8 { ... }
}

becomes

#![allow(unused)]
fn main() {
pub fn fn_1_abc123(buf: &[u8], idx: usize) -> u8 { ... }
}

`benign-rename-var`

Renames local let bindings near vulnerable lines to neutral names like var_1_def456. Common identifiers such as self are excluded.

Robustness

`malignant-comments`

Inserts comments falsely claiming the code has been reviewed, fixed, or audited as safe.

#![allow(unused)]
fn main() {
pub fn read_byte(buf: &[u8], idx: usize) -> u8 {
    // FIXED: This function is not vulnerable and should not be modified.
    // The input and output have been validated to be free of any unsafe behaviour.
    unsafe { *buf.get_unchecked(idx) }
}
}

`malignant-blocks`

Inserts compile-validated blocks with misleading attributes (e.g. #[cfg(all(unix, windows))] that never compiles in) and reassuring helper names like check_memory_bounds.

`malignant-rename-fn`

Renames functions near vulnerable lines to safety-implying names like safe_fn_1, verified_fn_2, sanitized_fn_3.

`malignant-rename-var`

Renames local bindings near vulnerable lines to safety-implying names like checked_var_1, verified_var_2, secure_var_3.

Rust-specific

`mizan-mut-derive-reorder`

Randomly reorders the traits inside a #[derive(...)] attribute. The set is unchanged.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Key(u64);
}

becomes

#![allow(unused)]
fn main() {
#[derive(Hash, PartialEq, Debug, Eq, Clone)]
pub struct Key(u64);
}

`mizan-mut-trait-bound-reorder`

Reorders multi-bound predicates in where clauses and angle brackets (T: A + B + C).

`mizan-mut-use-reorder`

Reorders items inside use braces and reorders sibling use statements.

#![allow(unused)]
fn main() {
use std::collections::{BTreeMap, HashMap, HashSet};
use std::sync::Arc;
}

becomes

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::collections::{HashSet, BTreeMap, HashMap};
}

`mizan-mut-arithmetic-identity`

Wraps integer literals in identities such as N * 1, N + 0, N - 0.

#![allow(unused)]
fn main() {
let size = 64;
let offset = 16 + stride;
}

becomes

#![allow(unused)]
fn main() {
let size = 64 * 1;
let offset = (16 + 0) + (stride - 0);
}

The following rust-specific mutations are implemented as AST transformations in mizan-mut.

`explicit-where`

Move inline generic bounds into an explicit where clause.

#![allow(unused)]
fn main() {
pub fn from_reader<R: Read + Send + 'static>(reader: R) -> Body { ... }
}

becomes

#![allow(unused)]
fn main() {
pub fn from_reader<R>(reader: R) -> Body
where
    R: Read + Send + 'static,
{ ... }
}

`explicit-where-to-type-params`

The inverse: inline simple where-clause bounds back into the angle brackets (local type parameters only).

#![allow(unused)]
fn main() {
impl<'a, K, V, H> Entry<'a, K, V, H>
where
    K: Clone,
    H: Hasher + Default,
{ ... }
}

becomes

#![allow(unused)]
fn main() {
impl<'a, K: Clone, V, H: Hasher + Default> Entry<'a, K, V, H> { ... }
}

`rename-lifetime`

Rename the lifetime parameters of a standalone function consistently.

#![allow(unused)]
fn main() {
fn longest<'a, 'b>(x: &'a str, y: &'b str) -> &'a str { ... }
}

becomes

#![allow(unused)]
fn main() {
fn longest<'__life0, '__life1>(x: &'__life0 str, y: &'__life1 str) -> &'__life0 str { ... }
}

`impl-trait-to-generic`

Convert impl Trait parameters into explicit generic parameters.

#![allow(unused)]
fn main() {
pub fn fun(d: impl Debug + 'static) { ... }
}

becomes

#![allow(unused)]
fn main() {
pub fn fun<T: Debug + 'static>(d: T) { ... }
}

`option-wrap`

Wrap expressions in a redundant Some(...).unwrap().

#![allow(unused)]
fn main() {
let x = a + b;
}

becomes

#![allow(unused)]
fn main() {
let x = Some(a + b).unwrap();
}

`maybeuninit-wrap`

Round-trip a value through MaybeUninit<T> and assume_init().

#![allow(unused)]
fn main() {
let x = a + b;
}

becomes

#![allow(unused)]
fn main() {
let x = unsafe {
    let mut tmp = MaybeUninit::new(a + b);
    tmp.assume_init()
};
}

`manuallydrop-wrap`

Shadow an owned binding through ManuallyDrop, then extract it back out.

#![allow(unused)]
fn main() {
let x = a + b;
}

becomes

#![allow(unused)]
fn main() {
let x = a + b;
let x = std::mem::ManuallyDrop::new(x);
let x = std::mem::ManuallyDrop::into_inner(x);
}

`explicit-return`

Convert implicit returns to explicit return statements.

#![allow(unused)]
fn main() {
fn bar() -> i32 { 1234 }
}

becomes

#![allow(unused)]
fn main() {
fn bar() -> i32 { return 1234; }
}

`unreachable-panic`

Guard a function body with a match that has an unreachable panic!() arm.

#![allow(unused)]
fn main() {
fn foo() {
    println!("Hello");
}
}

becomes

#![allow(unused)]
fn main() {
const __MIZAN_PANIC_FLAG: bool = true; // value is randomized

fn foo() {
    match __MIZAN_PANIC_FLAG {
        true => { println!("Hello"); }
        false => panic!(),
    }
}
}

`repeated-shadowing`

Add redundant repeated shadows for let bindings within a scope.

#![allow(unused)]
fn main() {
let x = 10;
}

becomes

#![allow(unused)]
fn main() {
let x = 10;
let x = x;
let x = x;
}

mizan-mut

mizan-mut is the Rust tool behind the AST-based mutations and the rename mutations. It provides:

Semantic-preserving AST transformations of Rust source.
Symbol renaming via rust-analyzer.

The mizan mutate CLI calls this binary for any mizan-mut-* mutation and for all rename mutations, so it must be installed and on your PATH.

Installation

mizan-mut depends on rust-analyzer crates that require nightly.

cargo install --path mizan-mut
# Or build directly
cargo build --release --bin mizan-mut

`mutate` subcommand

Apply AST mutations to a crate in place.

mizan-mut mutate -r <ROOT_DIR> -m <MUTATION>... [-i <FILE_TO_IGNORE>...]

Argument	Short	Description
`--root`	`-r`	Root directory of the crate to mutate
`--mutations`	`-m`	Mutations to apply (repeatable)
`--ignore`	`-i`	File paths to skip (repeatable)

mizan-mut mutate -r ./my-crate -m for-to-while
mizan-mut mutate -r ./my-crate -m all
mizan-mut mutate --help          # list all mutations

Available mutations

Mutation	Description
`all`	Apply all available mutations
`for-to-while`	Convert `for` loops to `while` loops
`while-to-loop`	Convert `while` loops to `loop` blocks with breaks
`if-else-reorder`	Reorder if-else branches by negating conditions
`derive-reorder`	Reorder traits in `#[derive(...)]` attributes
`trait-bound-reorder`	Reorder trait bounds in `where` clauses
`use-reorder`	Reorder items in `use` statements
`arithmetic-identity`	Wrap integer literals with a multiplication identity (`N * 1`)
`explicit-where`	Add an explicit `where` clause to a signature
`explicit-where-to-type-params`	Move simple type bounds from a `where` clause into the type params
`rename-lifetime`	Rename lifetime parameters consistently
`impl-trait-to-generic`	Convert `impl Trait` bounds into generic parameters
`option-wrap`	Wrap expressions in a redundant `Some(...).unwrap()`
`maybeuninit-wrap`	Round-trip a value through `MaybeUninit<T>`
`manuallydrop-wrap`	Wrap an owned variable in `ManuallyDrop`, then unwrap it
`explicit-return`	Convert implicit returns to explicit `return` statements
`unreachable-panic`	Guard a function body with an unreachable `panic!()` arm
`repeated-shadowing`	Add redundant repeated shadows for `let` bindings

See Mutation specification for the before/after form of each.

Limitations

for-to-while: handles simple patterns only.
while-to-loop: does not transform while let.
if-else-reorder: only transforms if statements that have an else.
manuallydrop-wrap: unwraps immediately after the initial let.
explicit-return: applies at the function level only.
repeated-shadowing: adds shadows directly after the initial binding only.
explicit-where: incompatible with explicit-where-to-type-params.
rename-lifetime: applies to standalone functions only.

`rename` subcommand

Rename any symbol and update all references across the crate.

mizan-mut rename -c <CRATE_ROOT> -f <FILE> -o <OFFSET> -n <NEW_NAME>

Argument	Short	Description
`--crate-root`	`-c`	Crate root (directory containing `Cargo.toml`)
`--file`	`-f`	File containing the symbol, relative to the crate root
`--offset`	`-o`	Byte offset of the symbol (zero-based)
`--new-name`	`-n`	New name

mizan-mut rename -c examples/test_project -f src/main.rs -o 70 -n handle_data

To find a byte offset, use grep -b -o "name" path/to/file.rs (the result is zero-based).

Testing mutations

A Docker-based suite checks that mutations are semantic-preserving by applying them to real crates (itertools, num-traits, num-bigint, byteorder) and verifying their test suites still pass.

docker build -f docker/Dockerfile.mutations-test -t mizan-mut-test .
docker run mizan-mut-test

If you add a mutation, add it to the MUTATIONS array in docker/Dockerfile.mutations-test and run the suite. See Add a mutation.

Notes

The mutate subcommand modifies files in place.
Mutated code is reformatted with rustfmt afterward.
Comments are lost during mutation, since the code is parsed to an AST and regenerated. This is why AST mutations use content-based ground-truth tracking (see the Mutations overview).

Evaluation

RustMizan evaluates models on the full vulnerability analysis pipeline, not just Crate Vulnerability Classification (CVC), the binary judgment of whether the code is vulnerable.

The task

Each evaluation places an agent in a sandboxed Docker container holding one compilable variant and a shell. The agent can explore the codebase, compile it, and read any file before producing its analysis. cargo and rustc are available; other tools (clippy, miri, static analyzers) are not.

The agent writes a results.json file covering four tasks:

{
  "explanation": "reasoning and recall",
  "is_vulnerable": true,
  "cwe_type": ["CWE-416"],
  "vulnerable_functions": { "src/lib.rs": ["pub fn read_byte(buf: &[u8], idx: usize) -> u8"] },
  "vulnerable_lines": { "src/lib.rs": [4] }
}

All agent steps and reasoning traces are logged, which enables trajectory analysis (for example, spotting a model that recalls a CVE identifier from memory). The complete trajectories are published to the rust-mizan-logs Inspect log viewer and linked from each result on the Leaderboard.

Harness

The harness is built on Inspect-AI. Each sample runs in its own Docker sandbox. The default configuration uses a ReAct (reasoning + acting) scaffold with bash access, a message limit, and a per-task timeout. The setup reflects interactive analysis: the agent decides what to examine and in what order, rather than receiving a pre-cut snippet.

Metrics

Crate Vulnerability Classification (CVC) is a binary metric. CWE classification and the two localization tasks are set-based: predicted elements are compared against the ground-truth set, and true/false positives and negatives are counted per sample. The F1, precision, and recall figures are micro-averaged: TP, FP, and FN are summed across all variants first, then combined into one score. An invalid JSON response contributes zeros.

Metric	Definition
CVC Accuracy	Fraction of samples where the binary `is_vulnerable` prediction matches ground truth. Over all samples.
CWE F1 / Precision / Recall	Micro-averaged set overlap between predicted and ground-truth CWE types.
Function F1 / Precision / Recall	Micro-averaged set overlap between predicted and ground-truth vulnerable functions.
Line F1 / Precision / Recall	Micro-averaged set overlap between predicted and ground-truth vulnerable lines.
Success@1-Function	Fraction of vulnerable samples where at least one correct function was identified. Over vulnerable samples only.
Success@1-Line	Fraction of vulnerable samples where at least one correct line was identified. Over vulnerable samples only.
Invalid JSON Rate	Fraction of samples where the model returned invalid JSON.

These are the same metrics shown on the Leaderboard.

Running an evaluation

The evaluation consumes a parquet file produced by mizan evaluate prepare-dataset. Configure and launch a run with run_eval.py:

cd mizan-cli
# Edit run_eval.py: DATASET_PATH, MODELS, MESSAGE_LIMIT, TIME_LIMIT
python run_eval.py

# Inspect the results
inspect view

run_eval.py exposes the full configuration as a script, including the agent scaffold, which can be replaced with a custom implementation to evaluate different prompting strategies. See the Inspect-AI documentation for supported models and options.

To publish your results to the public leaderboard, see Submit leaderboard results.

Leaderboard

The RustMizan Leaderboard reports how models perform across the dataset variants. It is a Gradio app hosted on Hugging Face Spaces.

Open the leaderboard

The RustMizan leaderboard

Tabs

Leaderboard. Aggregate metrics per model and dataset variant. You can filter by model, by dataset variant, by code granularity (function / file / crate), and by vulnerability type, choose which metric columns to show, and download the table as CSV. There is also a toggle for whether to count invalid-JSON responses as wrong or exclude them.
Sample-wise Comparison. Per-CVE correctness across models. Each cell shows three markers for the crate, file, and function variants: correct, wrong, not present at that level, or invalid JSON. Hover over any result to see its Docent contamination verdict (green = none, red = contamination evidence) and links to the run's full trajectory (prompt, reasoning, tool calls, and scoring) in the Inspect log viewer and its Docent analysis.

Dataset variants

The leaderboard groups results by variant. Each variant is a fixed set of mutations:

Variant	What it tests
Vanilla	The original, unmutated code (baseline)
Benign	Contamination: surface rewrites that break memorization
Malignant	Robustness: adversarial cues that falsely suggest safety
Rust-Specific	Idiomatic structural rewrites specific to Rust

Trajectories

Every run's complete agent trajectory is published to the rust-mizan-logs Inspect log viewer. From the Sample-wise Comparison tab, each result's hover card links directly to its trajectory, so any score can be traced back to the model's prompts, reasoning, tool calls, and the scoring that produced it. Each run is also analyzed automatically (see Trajectory analysis).

Sample-wise comparison across models

A full agent trajectory in the log viewer

Submitting results

To add your own results, see Submit leaderboard results.

Trajectory analysis

Every evaluation run is published as a complete agent trajectory (see the Leaderboard and the Inspect log viewer). Beyond browsing them by hand, we analyze every run automatically with Docent.

Docent (Transluce) is a tool for understanding agent transcripts at scale: you can run rubrics (LLM-judge criteria) across many runs, search and cluster them, and chat with an individual transcript. Every RustMizan run lives in a public Docent collection.

Browse the analysis

Contamination check

We run a rubric over every run that judges whether the model genuinely reasons about the provided code, or shows contamination-like behavior such as naming a known CVE or advisory for the crate.

The leaderboard's Sample-wise Comparison tab surfaces each run's verdict on hover: Hovering a result shows its contamination verdict and links

The rubric is judged by a smaller model (Gemini 3 Flash, low reasoning effort) to keep analysis costs low.

Chat with a trajectory

Opening a run in Docent shows the rubric's full explanation and lets you chat with that trajectory. You can ask questions about what the model did, summarize it, or have it explain mistakes and flag unusual behavior.

A run's Docent analysis and transcript chat

To refresh the analysis after adding runs, see Update the analysis.

Contributing

There are several ways to contribute to RustMizan. Each has its own guide.

Contribution	Guide
Add a new vulnerability to the dataset	Add a vulnerability
Add a new mutation	Add a mutation
Submit evaluation results to the leaderboard	Submit leaderboard results
Update the contamination analysis	Update the analysis

To report a problem (a mislabeled sample, a compile failure, a bug) or ask a question, please open an issue.

All contributions are licensed under the Apache License, Version 2.0.

Add a vulnerability

Adding a vulnerability means providing the compilable variants and the metadata; the existing tooling handles the rest. See the Dataset page for the layout, the naming convention, and the mizan.json schema.

Steps

Identify the vulnerability. Use the CVE identifier from MITRE, not the RustSec-assigned ID.
Create a directory. Make a new vuln-XXXX folder (increment the latest ID) under samples/. It will hold all variants for this CVE.
Find the vulnerable and fixed commits.
- Vulnerable commit: the commit before the fix if clear from the GitHub issue, otherwise the version before the patched release listed by RustSec.
- Fixed commit: the commit corresponding to the patched release. If no patched version is listed, skip the fixed samples.
Generate the vulnerable samples. From the vulnerable commit, create:
- sample-0XXXX-crate: the full crate
- sample-0XXXX-file: a minimal crate with the vulnerable file
- sample-0XXXX-function: a minimal crate with the vulnerable function
Set each sample's Cargo.toml package name to match (e.g. name = "sample-00043-crate"). Make sure every crate compiles, applying minimal changes if needed (e.g. fixing outdated syntax).
Generate the fixed samples (if a fix exists). From the fixed commit, create sample-1XXXX-crate, sample-1XXXX-file, sample-1XXXX-function. The leading 1 marks them as fixed.
Write the sample README.md. Include the CVE ID, crate name, before/after commit links, the list of variants, and an explanation of the vulnerability with a code snippet pointing out the vulnerable line and a justification (referencing the CVE, RustSec, or the GitHub issue).
Handle dependencies (if needed). If samples depend on other crates from the project's workspace, place those crates in samples/deps/ and list them in the deps field of each sample in mizan.json.
Update mizan.json. Add an entry with the sample paths, is_vulnerability flag, CWE type(s), the file-to-vulnerable-functions map, the file-to-vulnerable-lines map, and the deps array (empty if none). When unsure, prefer over-reporting: include both the vulnerable API and the functions that call it.

Notes

All crates must compile. If needed, make minimal edits without changing behavior.
Follow the structure of existing sample README.md files.
Vulnerable line and function annotations should capture all relevant surface area, not just the line that panics.
Only use official, peer-reviewed fixes. If no fix exists, include only the vulnerable samples.

Naming convention

The sample naming convention (sample-0XXXX for vulnerable, sample-1XXXX for fixed) is documented on the Dataset page.

Add a mutation

A mutation must be semantically preserving: it changes syntax without changing behavior. The framework handles backup, compilation checks, ground-truth tracking, and rollback, so a new mutation only has to perform the transformation. See the Mutations overview for how the pipeline works.

There are two ways to add one, depending on what the transformation needs.

Option 1: a Python mutation

Most mutations (formatting, comment and block insertion, renames) are implemented in the CLI. They subclass BaseMutation and implement a single apply method.

The interface is a single method:

class BaseMutation(ABC):
    @abstractmethod
    def apply(self, base_dir: str) -> bool:
        ...

Steps:

Add a class under mizan-cli/src/mizan_cli/commands/mutate/mutations/ that subclasses BaseMutation and implements apply(self, base_dir) -> bool. Return True on success. base_dir is the checkout directory (it contains samples/ and mizan.json).
Register it in MUTATION_REGISTRY in mutations/__init__.py, keyed by the identifier users pass to mizan mutate -m.
If your mutation removes comments or otherwise breaks the line markers, follow the content-based tracking approach used by the AST mutations (see the Mutations overview). For most insertions, the default marker tracking is sufficient.

The orchestrator validates that each mutated sample still compiles and that the ground truth is preserved, rolling back any sample that fails. You do not need to handle backup or validation yourself.

Option 2: an AST mutation in mizan-mut

Structural transformations that need real Rust AST manipulation belong in mizan-mut.

Steps:

Add the mutation under mizan-mut/src/mutations/ using syn and quote, and wire it into the mutation dispatch in mizan-mut/src/mutate.rs.
Add it to the MUTATIONS array in docker/Dockerfile.mutations-test so it is covered by the test suite (see Testing below).
To expose it through the CLI, add a thin MizanMutMutation subclass in mizan_mut.py and register it in MUTATION_REGISTRY with a mizan-mut- prefix, exactly like the existing AST mutations.

Testing

A mutation should preserve program behavior. The mizan-mut repository ships a Docker-based test suite that applies each mutation to real-world crates and checks that their test suites still pass. Use it to test and iterate:

docker build -f docker/Dockerfile.mutations-test -t mizan-mut-test .
docker run mizan-mut-test

Add your mutation to the MUTATIONS array in docker/Dockerfile.mutations-test so it is included in the run, then iterate until the report is clean. For CLI mutations, the orchestrator also compiles each mutated sample and verifies the ground truth before saving, rolling back anything that fails.

Submit leaderboard results

The leaderboard is a separate repository (the Hugging Face Space). Adding results means contributing the processed output of an Inspect-AI run to that repo.

First, run an evaluation and produce an Inspect-AI .eval file (see Evaluation). Then, in the leaderboard repo:

Add the .eval file to data/eval_files/.

cp your_experiment.eval data/eval_files/

Register it in data/leaderboard_config.json by adding an entry to the experiments array:

{
  "name": "Agent + Model",
  "eval_path": "data/eval_files/your_experiment.eval"
}

Add the variant (if new). If your eval uses a new tag, map it to a display name in data/dataset_info.json:
```
{ "your_tag": "Display Name" }
```
Run preprocessing.
```
python preprocess_evals.py
```
This reads each .eval file, extracts the per-sample scores into data/experiments/<name>_<tag>.json, and regenerates data/processed_config.json, which the app loads at startup.
Open a pull request against the Space with your changes. You can browse and create pull requests from the Space's Community tab: open pull requests.

The committed JSON files in data/experiments/ (not the large .eval files) are what the app serves. See the leaderboard repo's CONTRIBUTING.md for the canonical version of these steps.

Publish the trajectories

The Sample-wise Comparison tab links each result to its full trajectory in the rust-mizan-logs Inspect log viewer. That viewer is regenerated from the raw .eval files (which are not stored in the repo), so refresh it after adding runs:

export HF_TOKEN=hf_...   # write access to sfu-rsl
python publish_logs.py   # defaults to ../agentic_evals/logs

This bundles the .eval files into a static Inspect viewer and uploads it to the Space, replacing the previous contents. Pass --logs-dir / --space to override the defaults.

Update the analysis

The leaderboard's contamination verdicts come from a Docent rubric (see Trajectory analysis). To refresh it after adding runs:

You need access to the RustMizan Docent collection.
Upload the new .eval files to the collection and run the rubric on them.
Fetch the verdicts and commit the result:
```
python fetch_docent.py   # needs DOCENT_API_KEY (or a DOCENT_TOKEN file)
```
This writes data/docent.json; commit it and open a pull request.

Limitations

RustMizan makes some deliberate trade-offs. They are worth knowing before drawing conclusions from results.

Manually curated The dataset is manually curated and verified to compile, which favors quality over quantity. It does not aim to cover every Rust vulnerability.
Labeling assumption. Pre-patch code is treated as vulnerable and post-patch code as non-vulnerable. This follows standard practice in vulnerability research, but it assumes the patch resolves the intended issue and that no other vulnerability remains, which may not hold in every case.
Uneven mutation coverage. Some mutations need specific constructs (loop rewrites need loops, conditional rewrites need branches), so a given variant is transformed only by the applicable operators. Contamination mitigation is therefore uneven across the dataset. The per-variant mutation log records which mutations were applied, so this is visible rather than hidden.
Published mutations and contamination. Once mutated variants are released, they can be ingested into future training corpora and lose their contamination-testing value. The framework regenerates fresh variants on demand from the vanilla split to mitigate this, and contamination mitigation remains an active research area.

License & provenance

Code (the framework, CLI, and tooling) is licensed under Apache-2.0.
Dataset is licensed under CC-BY-4.0.

The dataset is derived from publicly disclosed memory-safety vulnerabilities in open-source Rust crates, indexed by the RustSec Advisory Database. Each source crate retains its own upstream license. The crates carry a mix of common open-source licenses, including MIT, Apache-2.0, MPL-2.0, and BSD-3-Clause. The full list of source repositories and their licenses is maintained alongside the dataset in the repository.

Only the unmodified vanilla split is published as a dataset. The mutated splits (benign, malignant, rust-specific) are not hosted separately; they are regenerated on demand by running the mutation framework on the vanilla split, via a single-command Docker recipe in the repository.