- Start revision:
e816fc3e4af1 - Branch:
codex/perry-performance-20260617 - Worker assignment: single Codex pass in this worktree
- Benchmark environment: Linux
/usr/bin/time -v; localnodecannot execute.tsbenchmark inputs, so Node columns and correctness comparisons were skipped by the harness - Baseline commands:
cargo build --release./benchmarks/quick.sh./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-baseline-e816fc3e4.json
- Baseline results:
- quick: fibonacci 260ms/18MB, math_intensive 73ms/18MB, nested_loops 3508ms/17MB, factorial 95ms/18MB, matrix_multiply 6462ms/27MB
- compare quick medians: loop_overhead 74ms/18772KB, fibonacci 262ms/18696KB, math_intensive 70ms/18696KB, nested_loops 3383ms/17724KB, factorial 96ms/18836KB
- Selected gap and evidence:
nested_loopsdominated the quick compare set at 3383ms;matrix_multiplywas the slowestquick.shcase at 6462ms.- LLVM trace for
benchmarks/suite/10_nested_loops.tsshowedjs_typed_feedback_register_site(...)emitted inside the hotfor.body.21inner loop before each typed-feedback array guard.
- Change:
- Added
LlFunction::entry_setup_call_voidand changed typed-feedback site registration to emit once in function-entry setup instead of at every guard use site. - Kept guard, fallback, pass, and counter calls at original use sites so runtime evidence semantics remain per-use.
- Updated benchmark harnesses to support Linux RSS measurement and skip Node
.tscolumns when the installed Node cannot run TypeScript directly.
- Added
- Post-change benchmark commands:
cargo build --release./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-final-e816fc3e4.json./benchmarks/quick.sh
- Post-change results:
- compare quick medians: loop_overhead 74ms/18768KB, fibonacci 261ms/18920KB, math_intensive 69ms/18944KB, nested_loops 956ms/19152KB, factorial 94ms/18896KB
- quick: fibonacci 262ms/18MB, math_intensive 55ms/18MB, nested_loops 965ms/18MB, factorial 75ms/18MB, matrix_multiply 1842ms/28MB
- Measured impact:
10_nested_loopscompare median: 3383ms -> 956ms, 71.7% faster16_matrix_multiplyquick: 6462ms -> 1842ms, 71.5% faster
- Verification:
bash -n benchmarks/quick.shbash -n benchmarks/compare.shcargo fmt --checkcargo test -p perry-codegen --test typed_feedbackPERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.pytests/test_benchmark_output_verifier.shtarget/release/perry compile --no-cache benchmarks/suite/10_nested_loops.ts -o /tmp/perry-nested-loops-final --trace llvm --quiet; trace confirmed registration calls in entry setup only and no registration calls infor.body.21/tmp/perry-nested-loops-finalproducednested_loops:963andsum:26991000000target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-multiply-final --quiet && /tmp/perry-matrix-multiply-finalproducedmatrix_multiply:1778andchecksum:41079519680
- Notes:
benchmarks/baseline.jsonis stale for this Linux environment; compare was run with--warn-onlyand the before/after comparison above uses the captured local baseline JSON.- Follow-up candidates remain in typed array and numeric array hot paths, but this cycle stopped at the isolated registration-hoist optimization.
- PR: #5295