Skip to content

Latest commit

 

History

History
379 lines (284 loc) · 22.6 KB

File metadata and controls

379 lines (284 loc) · 22.6 KB

ByteSpan + SourceMap Optimization Plan

Context

Problem: Every token, trivia token, AST node (~42 types), and parse error stores a GraphQLSourceSpan (~104 bytes: two SourcePosition values + Option<PathBuf>). The lexer eagerly computes line/col/utf16 at every character boundary (5 field writes per consume() call), and clones a PathBuf on every make_span() call (~26 sites). For a 1.2MB schema this means ~250K PathBuf clones.

Why now: Custom AST (Section 4.2 Phase 4d) is complete. 672 tests + benchmark suite provide a safety net. This is the natural time to attempt the optimization.

Intended outcome: Replace GraphQLSourceSpan with ByteSpan (8 bytes, Copy) everywhere. Build a shared SourceMap once via O(n) pre-pass. Resolve line/col lazily on demand. Both UTF-8 and UTF-16 columns losslessly recoverable. Net perf win confirmed by benchmarks before committing.

Previous attempt failed because it tried to collect line_starts inside lexer hot paths. This plan uses a pre-pass strategy that completely decouples newline scanning from lexing.


Architecture After Change

  StrGraphQLTokenSource
  ┌────────────────────────────────┐
  │ source: &'src str              │
  │ source_map: SourceMap<'src>    │──build in ctor (pre-pass)
  │ curr_byte_offset: usize        │
  │ pending_trivia, finished       │
  └────────────────────────────────┘
       │ impl GraphQLTokenSource<'src>
       │   - Iterator<Item = GraphQLToken<'src>>
       │   - source_map(&self) -> &SourceMap<'src>
       │   - into_source_map(self) -> SourceMap<'src>
       │
       │ emits tokens with ByteSpan
       v
  GraphQLParser<S: GraphQLTokenSource<'src>>
       │
       │ consumes S → calls S.into_source_map()
       v
  ParseResult<'src, TAst>
  ┌─────────────────────────────────────┐
  │ Ok { ast, source_map }              │
  │ Recovered { ast, errors, source_map}│
  └─────────────────────────────────────┘
       │
       │ source_map.resolve_span(ByteSpan) → SourceSpan (transient, on demand)
       v
  Error formatting, IDE/LSP, etc.

Naming Conventions

Type Role
ByteSpan Compact storage span (8 bytes, Copy). Stored on all tokens, AST nodes, errors.
SourceSpan Rich resolved span with line/col/file. Transient — produced on demand, never stored. Renamed from GraphQLSourceSpan.
SourcePosition Single resolved position (line, col_utf8, col_utf16, byte_offset). Kept as-is.
SourceMap<'src> Maps byte offsets → line/col. Built once per parse, shared across all lookups.

Resolution methods:

  • ByteSpan::resolve_source_span(source_map: &SourceMap) → SourceSpan — convenience on ByteSpan
  • SourceMap::resolve_span(ByteSpan) → SourceSpan — resolve both endpoints + attach file_path
  • SourceMap::resolve_offset(u32) → SourcePosition — resolve a single byte offset

Key Design Decisions

  1. Pre-pass for line_startsStrGraphQLTokenSource scans source for \n/\r/\r\n in its constructor, building line_starts into its internal SourceMap. Other token sources build their SourceMaps differently (e.g. SourceMap::empty() for proc-macro). A future streaming token source would build line_starts incrementally as chunks arrive.
  2. SourceMap on the GraphQLTokenSource trait — all token sources implement source_map(&self) -> &SourceMap<'src> and into_source_map(self) -> SourceMap<'src>. This ensures uniform line/col resolution for all consumers (error formatting, IDEs over token streams, IDEs over ASTs, etc). The blanket impl<T: Iterator> GraphQLTokenSource for T is removed — token sources become intentional implementations.
  3. SourceMap inside ParseResultParseResult<'src, TAst> gains a lifetime and carries source_map: SourceMap<'src> on both variants. TAst already implicitly carries 'src (e.g. Document<'src>), so this is a natural extension. The parser consumes the token source via into_source_map() and bundles the result.
  4. ByteSpan everywhere — public API change. All types use ByteSpan { start: u32, end: u32 }.
  5. Column recovery on demand — binary search line_starts → line, then count chars/UTF-16 units from line start to byte offset. Only happens on error formatting and IDE queries (cold path).
  6. GraphQLSourceSpan renamed to SourceSpan — kept as a transient "resolved" type returned by SourceMap::resolve_span(). Not stored in any struct.

Phase Completion Protocol

Every phase (except Phase 0) ends with:

  1. cargo test — full workspace (all tests pass)
  2. cargo clippy --tests — full workspace (clean)
  3. 15-minute fuzz test run (skip for Phase 1 since no behavioral changes)
  4. sl commit with thorough description of the phase's changes

Phases

Phase 0: Baseline Benchmarks ✅ COMPLETE

  • Run cargo bench 3x, save criterion output
  • Record: schema_parse (github, shopify_admin), executable_parse (simple, complex), lexer (github)
  • Verify <5% variance between runs

Pre-ByteSpan baseline captured in crates/libgraphql-parser/benchmark-optimizations.md (B19 entry, Mar 4). No separate criterion --save-baseline run was performed; B19 results serve as the effective Phase 0 baseline. Key pre-migration numbers: github schema parse 14.9ms (default), 8.58ms (lean); lexer github 3.66ms (~319 MiB/s).

Phase 1: Add ByteSpan + SourceMap Types (Additive Only) ✅ COMPLETE

No existing behavior modified. All 752 tests pass (including 58 new tests).

New files:

  • src/byte_span.rsByteSpan { start: u32, end: u32 }, #[derive(Copy, Default)], #[repr(C)]
  • src/source_map.rs — Dual-mode SourceMap<'src> with internal SourceMapData enum
  • src/tests/byte_span_tests.rs — 10 tests covering size, construction, merge, Copy, Hash, etc.
  • src/tests/source_map_tests.rs — 23 tests covering both modes, Unicode, edge cases, round-trip validation

SourceMap dual-mode design (deviation from plan): Plan originally had single-mode SourceMap with source text. Implemented as dual-mode:

  • SourceMap::new_with_source(source, file_path) — source-text mode with line_starts pre-pass
  • SourceMap::new_precomputed(file_path) — pre-computed columns mode for token sources without source text (e.g. RustMacroGraphQLTokenSource)
  • Internal SourceMapData enum dispatches between modes (non-pub)

SourceMap key methods:

  • resolve_offset(u32) → Option<SourcePosition> — returns None for unresolvable offsets (no debug_asserts on query path)
  • resolve_span(ByteSpan) → Option<GraphQLSourceSpan> — returns None if either endpoint fails
  • insert_computed_position(u32, SourcePosition) — pre-computed mode only, debug_asserts on monotonic ordering + correct mode
  • source() → Option<&'src str>, file_path() → Option<&Path>

Deviations from plan:

  1. SourceMap is dual-mode (concrete struct with enum dispatch) instead of single-mode — avoids viral generics on ParseResult
  2. resolve_offset() returns Option<SourcePosition> instead of bare SourcePosition — cleaner error signaling
  3. No ByteSpan::resolve_source_span() convenience method added — deferred, may not be needed
  4. debug_asserts removed from resolve_offset() query path (kept on insert_computed_position() producer path) — None is the contract for unresolvable offsets, debug_asserts would panic before None was returned in debug builds
  5. col_utf8 naming TODO added — counts Unicode scalar values not UTF-8 bytes, name is misleading

Conversion bridge (temporary):

  • GraphQLSourceSpan::to_byte_span() → ByteSpan

Phase completion: cargo test (752 pass), cargo clippy --tests (clean), sl commit

Phase 2: Rename GraphQLSourceSpanSourceSpan ✅ COMPLETE

Mechanical rename across 63 files in all 3 workspace crates. No behavioral changes.

  • Renamed struct GraphQLSourceSpanSourceSpan
  • Renamed file graphql_source_span.rssource_span.rs (tracked via sl mv)
  • Updated module declaration + re-export in lib.rs
  • All imports and references updated across libgraphql-parser, libgraphql-macros, libgraphql

Note: Serena's LSP rename_symbol tool corrupted several files (partial substring matches inside other identifiers). Reverted and used sed with literal string replacement instead.

Phase completion: cargo test (1,077 pass), cargo clippy --tests (clean), sl commit

Phase 3: Update GraphQLTokenSource Trait + Migrate Tokens + Lexer ✅ COMPLETE

This is the highest-impact phase. The trait gains SourceMap methods, the lexer simplifies.

Update GraphQLTokenSource trait (src/token_source/graphql_token_source.rs):

  • Remove blanket impl impl<T: Iterator<Item = GraphQLToken>> GraphQLTokenSource for T
  • Add required methods:
    pub trait GraphQLTokenSource<'src>: Iterator<Item = GraphQLToken<'src>> {
        /// Borrow the SourceMap. Available at any point during
        /// tokenization — useful for IDE-like tools that need
        /// line/col lookups mid-stream.
        fn source_map(&self) -> &SourceMap<'src>;
    
        /// Consume this token source and return the owned SourceMap.
        /// Called by the parser after consuming all tokens (EOF).
        fn into_source_map(self) -> SourceMap<'src>;
    }

Modify GraphQLToken.span and GraphQLTriviaToken spans → ByteSpan

  • src/token/graphql_token.rs: pub span: ByteSpan
  • src/token/graphql_trivia_token.rs: all variant spans → ByteSpan

Simplify StrGraphQLTokenSource (src/token_source/str_to_graphql_token_source.rs):

Add field: source_map: SourceMap<'src> (built in constructor via pre-pass)

Remove 4 fields from struct:

  • curr_line (line tracking moves to SourceMap pre-pass)
  • curr_col_utf8 (column computed on demand from SourceMap)
  • curr_col_utf16 (same)
  • last_char_was_cr (only needed for line tracking)

Remaining mutable state: curr_byte_offset, pending_trivia, finished

Implement trait methods:

  • source_map(&self) -> &SourceMap<'src>&self.source_map
  • into_source_map(self) -> SourceMap<'src>self.source_map

Simplify methods:

  • consume(): becomes ~6 lines (just advance curr_byte_offset by char's UTF-8 byte len)
  • curr_position(): removed entirely
  • make_span(start: u32): ByteSpan::new(start, self.curr_byte_offset as u32) — no PathBuf clone
  • skip_whitespace(): ~20 lines (just advance past whitespace/newline bytes, no line/col tracking)
  • lex_block_string(): ~30 lines (byte-scan for closing """, no line/col tracking)
  • lex_comment(): byte-scan to EOL, remove compute_columns_for_span() call
  • lex_name(): byte-scan for name chars, self.curr_byte_offset = i (remove col updates)
  • lex_dot_or_ellipsis(start: u32): remove first_dot_line / self.curr_line checks — these are redundant because skip_whitespace_same_line() never crosses line boundaries, so dots separated by newlines always fall through to the default error case naturally
  • skip_whitespace_same_line(): unchanged (already only uses peek_char/consume)
  • All let start = self.curr_position()let start = self.curr_byte_offset as u32

Remove compute_columns_for_span() — no longer called by any lexer method

Update lexer tests:

  • Position tests: use source_map.resolve_offset() to verify line/col/utf16
  • Token kind tests: update span type annotations

Phase completion: cargo test, cargo clippy --tests, 15min fuzz test, sl commit

Benchmark checkpoint: Run cargo bench, compare lexer/parse throughput vs Phase 0 baseline. Expect improvement. If regression >3%: investigate pre-pass overhead.

Phases 3-5 landed as a single commit (c023ba07, Mar 5). Deviation: phases not committed separately due to tight coupling — AST node spans, parser spans, token spans, and error spans all needed to change together for tests to pass.

All items completed: blanket impl removed, source_map()/into_source_map() added to trait, tokens use ByteSpan, StrGraphQLTokenSource simplified (source_map field added, 4 tracking fields removed), consume() simplified, make_span() returns ByteSpan. compute_columns_for_span() removed. Lexer tests updated.

Phase completion: cargo test (1,077+ pass), cargo clippy --tests (clean), sl commit.

Phase 4: Migrate Parser + ParseResult to ByteSpan ✅ COMPLETE

Update ParseResult (src/parse_result.rs):

  • Add lifetime: ParseResult<'src, TAst>
  • Both variants gain source_map: SourceMap<'src>:
    pub enum ParseResult<'src, TAst> {
        Ok {
            ast: TAst,
            source_map: SourceMap<'src>,
        },
        Recovered {
            ast: TAst,
            errors: Vec<GraphQLParseError>,
            source_map: SourceMap<'src>,
        },
    }
  • Update all methods (valid_ast(), into_valid(), into_ast(), errors(), format_errors(), From impl, etc.) to carry the lifetime and propagate source_map
  • format_errors() uses the bundled source_map directly (no external parameter needed)

Update GraphQLParser (src/graphql_parser.rs, ~3350 lines):

Mechanical changes:

  • last_end_position: Option<SourcePosition>Option<u32>
  • make_span(start: SourceSpan) → SourceSpanmake_span(start: ByteSpan) → ByteSpan
  • make_span_ref(&SourceSpan) → SourceSpanmake_span_ref(&ByteSpan) → ByteSpan
  • eof_span() / document_span() → return ByteSpan
  • OpenDelimiter.spanByteSpan
  • 18 direct SourceSpan::new(...) calls → ByteSpan::new(...) (.start/.end instead of .start_inclusive/.end_exclusive)
  • 44 .span.clone() calls → just .span (Copy)
  • All parse_* methods: start.start_inclusivestart.start, start.end_exclusivestart.end

Parser completion: after consuming EOF token, call self.token_stream.into_source_map() and bundle into ParseResult:

let source_map = self.token_stream.into_source_map();
ParseResult::Ok { ast: document, source_map }

Note: GraphQLTokenStream wraps the token source for peek/consume. It will need an into_source_map() method that forwards to the underlying token source.

Update public parse functions (in lib.rs or convenience wrappers):

  • parse_schema_document(source) → returns ParseResult<'_, SchemaDocument<'_>>
  • Same for parse_executable_document, parse_mixed_document
  • No separate wrapper needed — ParseResult carries the SourceMap

Update parser tests — same pattern as lexer tests

Phase completion: cargo test, cargo clippy --tests, 15min fuzz test, sl commit

Completed in same commit as Phase 3 (c023ba07, Mar 5).

ParseResult gained 'src lifetime and carries SourceMap<'src> on both variants. last_end_position changed to Option<u32>. All parser span operations use ByteSpan. GraphQLTokenStream forwards into_source_map(). Public parse functions return ParseResult<'src, TAst>.

Phase completion: same commit as Phase 3.

Phase 5: Migrate AST Nodes + Errors to ByteSpan ✅ COMPLETE

AST nodes (src/ast/*.rs, ~42 files):

  • pub span: SourceSpanpub span: ByteSpan
  • Remove use crate::SourceSpanuse crate::ByteSpan

Errors (src/graphql_parse_error.rs):

  • GraphQLParseError.spanByteSpan
  • GraphQLErrorNote.spanOption<ByteSpan>
  • format_detailed(&self, source_map: &SourceMap) — resolve spans via SourceMap for display
  • format_oneline(&self, source_map: &SourceMap) — same

Compat layer (src/parser_compat/graphql_parser_v0_4/):

  • pos_from_span / end_pos_from_span / type_ext_pos_from_span → take &SourceMap parameter
  • Thread SourceMap through all to_* conversion functions via a context struct
  • from_* functions create ByteSpan from graphql_parser positions

Phase completion: cargo test, cargo clippy --tests, 15min fuzz test, sl commit

Completed in same commit as Phases 3-4 (c023ba07, Mar 5).

All ~42 AST node files migrated to pub span: ByteSpan. Parse errors use ByteSpan. Error formatting takes &SourceMap.

Compat layer partially updated: from_* functions create ByteSpan from graphql_parser positions. to_* functions use a context struct that builds SourceMap internally when source text provided; falls back to zero-spans otherwise. SourceMap NOT threaded through all function signatures as originally planned.

Phase completion: same commit as Phase 3.

Phase 6: Migrate RustMacroGraphQLTokenSource

In crates/libgraphql-macros/:

  • Implement GraphQLTokenSource trait explicitly (was previously via blanket impl):
    • source_map() → returns &self.source_map (a SourceMap::empty())
    • into_source_map() → returns self.source_map
  • Deviation: Used synthetic monotonically-increasing u32 offsets (Option B) instead of proc_macro2::Span::byte_range() — byte_range() is unreliable on stable Rust. Each token consumes 2 offsets (start + end). SpanMap maps u32proc_macro2::Span for compile_error! placement.
  • span_map: HashMap<(usize, usize), Span>HashMap<u32, Span> (keyed by synthetic offset)
  • Error display via compile_error! with original Span — unaffected by SourceMap changes
  • SourceMap::empty() in compat layer is correct (no source text in proc-macro path)

Phase completion: cargo test, cargo clippy --tests, sl commit

Status: ✅ COMPLETE. Landed in commit 4ec3f231 (Mar 5-6). All 139 macros tests + 6 doc-tests pass. Clippy clean.

Phase 7: Cleanup + Final Benchmarks

  • ✅ Removed SourceSpan::to_byte_span() bridge
  • ✅ Removed clippy::large_enum_variant allows on Definition, Nullability, TypeAnnotation, Selection — no longer needed with 8-byte ByteSpan
  • ✅ Cleaned up stale Phase TODOs in source comments
  • ✅ Ran full high-confidence benchmarks (300 samples, 20s measurement, 99% CI)
  • ✅ Updated libgraphql-parser README with new benchmark results

Results vs Feb 14 baseline: Schema parse 5-17% faster, lexer 2-3x throughput improvement (105→305 MiB/s small, 313→574 MiB/s shopify_admin). All CI widths <1.2%.

Status: ✅ COMPLETE. Landed in commits 4ec3f231 + 3146284c (Mar 6).


Critical Files

File Change
crates/libgraphql-parser/src/byte_span.rs NEW — ByteSpan type
crates/libgraphql-parser/src/source_map.rs NEW — SourceMap type
crates/libgraphql-parser/src/graphql_source_span.rs Renamed to source_span.rs, struct renamed GraphQLSourceSpanSourceSpan
crates/libgraphql-parser/src/token_source/graphql_token_source.rs Remove blanket impl, add source_map() + into_source_map() methods
crates/libgraphql-parser/src/token_source/str_to_graphql_token_source.rs MAJOR — add source_map field, remove 4 fields, simplify all hot paths, implement trait
crates/libgraphql-parser/src/token/graphql_token.rs span → ByteSpan
crates/libgraphql-parser/src/token/graphql_trivia_token.rs spans → ByteSpan
crates/libgraphql-parser/src/graphql_parser.rs ~3350 lines: mechanical ByteSpan migration, last_end_position → u32, call into_source_map()
crates/libgraphql-parser/src/parse_result.rs Add 'src lifetime, carry source_map: SourceMap<'src> on both variants
crates/libgraphql-parser/src/graphql_parse_error.rs span → ByteSpan, formatting takes &SourceMap
crates/libgraphql-parser/src/source_position.rs kept for SourceMap::resolve_offset return type
crates/libgraphql-parser/src/ast/*.rs ~42 files: span field → ByteSpan
crates/libgraphql-parser/src/ast/ast_node.rs append_span_source_slice updated for ByteSpan
crates/libgraphql-parser/src/lib.rs module renames + new declarations + re-exports
crates/libgraphql-parser/src/parser_compat/graphql_parser_v0_4/*.rs thread SourceMap through compat layer
crates/libgraphql-macros/src/rust_macro_graphql_token_source.rs explicit trait impl, emit ByteSpan, simplify span_map
crates/libgraphql-parser/benches/parse_benchmarks.rs update for new ParseResult API

Reusable Existing Code

  • SourcePosition::new() — reused as return type from SourceMap::resolve_offset()
  • SourceSpan::new() / ::with_file() — reused inside SourceMap::resolve_span() for transient resolved spans
  • Benchmark infrastructure (benches/parse_benchmarks.rs) — existing 7 groups provide before/after comparison
  • is_name_continue_byte() in lexer — unchanged, used in simplified lex_name
  • Fuzz test infrastructure (fuzz/) — used for 15min fuzz runs at end of each phase

Verification (per phase)

  1. cargo test — full workspace, all tests pass
  2. cargo clippy --tests — full workspace, clean
  3. 15-minute fuzz test (Phase 3+): ./scripts/run-fuzz-tests.sh or equivalent
  4. Round-trip validation (Phase 1): SourceMap-resolved positions match original SourcePosition for every token in every test fixture
  5. Benchmarks (Phase 3, Phase 7): cargo bench --package libgraphql-parser — compare all 7 groups vs Phase 0 baseline

Complexity Warnings (from previous attempt)

Risk Mitigation
line_starts in hot paths Pre-pass strategy: completely decoupled from lexer. Internal to StrGraphQLTokenSource ctor.
UTF-8 mid-codepoint byte offsets Lexer always produces char-boundary-aligned offsets. SourceMap uses str::chars() from line start — which starts at a valid char boundary. Debug assert for safety.
UTF-16 column recovery On-demand from source text via char::len_utf16() count. No pre-computed Utf16LineInfo table needed.
Two-phase migration Phases are testable independently. Phase 1 is purely additive. Phase 3 can be benchmarked before continuing.
BOM handling BOM (0xEF 0xBB 0xBF) is NOT a line terminator → not in line_starts. Column counting via chars() naturally handles BOM as 1 char (U+FEFF).
lex_dot_or_ellipsis same-line check skip_whitespace_same_line() never crosses newlines → the curr_line == first_dot_line guard is redundant and can be safely removed.
Blanket impl removal GraphQLTokenSource blanket impl removed. Both StrGraphQLTokenSource and RustMacroGraphQLTokenSource need explicit impls. This is intentional — token sources should be deliberate.
ParseResult lifetime ParseResult<'src, TAst> adds 'src. Since TAst already carries 'src implicitly (e.g. Document<'src>), most callers already have the lifetime in scope.
Streaming token sources Not blocked. A future streaming source implements the trait: builds SourceMap incrementally as chunks arrive, source_map() returns partial-but-correct map for content seen so far.