Skip to content

Harden edgezero #269 runtime config-store load (HTTP layer)#783

Open
prk-Jr wants to merge 28 commits into
mainfrom
feature/edgezero-269-http
Open

Harden edgezero #269 runtime config-store load (HTTP layer)#783
prk-Jr wants to merge 28 commits into
mainfrom
feature/edgezero-269-http

Conversation

@prk-Jr

@prk-Jr prk-Jr commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • HTTP-layer (runtime) half of edgezero stackpop/edgezero#269. Stacked on feature/ts-cli-next (which carries the #269 repin, Body fixes, the Fastly adapter migration, and config-store-backed Settings load). Draft — base is unmerged; retarget to main once ts-cli-next lands.
  • Core change (spec "option Y"): a config-store read failure (store unseeded, transient backend, or a listed key missing) now maps to a new TrustedServerError::ConfigStoreUnavailable503, while a reconstruct/verify failure (settings_from_config_entries: hash mismatch / unparseable) stays 500. One new error variant; no platform-layer change.
  • Security-aware: the actionable hint (run \ts config push``) goes to server logs (error chain); the public 503 body stays generic by design.
  • Includes the design spec + implementation plan, and the upstream finding/plan docs for context.

Base note: diff is against feature/ts-cli-next, so it shows only this branch's work. Against main it would include all of ts-cli-next.

Changes

File Change
crates/trusted-server-core/src/error.rs New ConfigStoreUnavailable { store_name, message } variant → 503 (+ exhaustiveness guard, unit test)
crates/trusted-server-core/src/settings_data.rs read_config_entry read failures → ConfigStoreUnavailable; tests (unseeded→503, malformed-hash→500, missing-listed-key→503, hint-in-chain)
crates/trusted-server-adapter-fastly/src/error.rs Test: ConfigStoreUnavailable renders 503 to client via to_error_response
crates/integration-tests/Cargo.lock Reconciled to edgezero #269
`docs/superpowers/specs plans/*`

Closes

n/a — issue linking skipped by request.

Test plan

  • cargo test --workspace (core 1376 / adapter 39, 0 fail)
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo fmt --all -- --check
  • cargo build -p trusted-server-adapter-fastly --release --target wasm32-wasip1
  • integration-tests workspace builds; docs format clean

Checklist

  • Changes follow CLAUDE.md conventions (error-stack, log, colocated tests)
  • No unwrap() in production code
  • No secrets committed
  • New code has tests

@prk-Jr prk-Jr self-assigned this Jun 18, 2026
Replace the custom trusted-server CLI lifecycle and config payload plumbing with a thin EdgeZero delegation layer using typed config push/validate flows.

Add TrustedServerAppConfig wrapper in core with deploy-time validation and move blob reconstruction into runtime helpers.

Drop flattened config entry publishing and route app-config through EdgeZero blob envelope handling while keeping edgezero flag reads in trusted_server_config.

Update CLI and architecture docs for the new model and adjust fastly adapter store selection.
@prk-Jr prk-Jr changed the base branch from feature/ts-cli-next to feature/ts-cli-audit June 24, 2026 05:45
@prk-Jr prk-Jr force-pushed the feature/edgezero-269-http branch from 76031c1 to 84f35fe Compare June 24, 2026 06:38
prk-Jr added 4 commits June 27, 2026 22:18
Reads (blob key + each chunk) map to ConfigStoreUnavailable (503); envelope/
chunk verification and settings validation stay Configuration (500). Covers the
blob/chunk load model on the updated ts-cli-audit base.
@prk-Jr prk-Jr force-pushed the feature/edgezero-269-http branch from 84f35fe to 854edef Compare June 27, 2026 16:49
@aram356 aram356 marked this pull request as ready for review June 29, 2026 15:24

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #783 against feature/ts-cli-audit, focusing on the runtime config-store load path, error-stack context/status behavior, Fastly/EdgeZero HTTP error rendering, public error-body leakage, and the added regression tests. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues in the changed code.

Findings

No blocking findings.

CI / Existing Reviews

  • GitHub checks currently show prepare integration artifacts failing, with downstream jobs skipped. The failure is dependency-lock parity drift in crates/trusted-server-integration-tests/Cargo.lock (for example anyhow, cssparser, html5ever, scraper, uuid, and related transitive dependencies). This should be resolved or confirmed as inherited from the stacked/base branch before merge.
  • Existing PR reviews: none returned by the GitHub API. Existing inline review comments: none.
  • Local focused check passed: cargo test -p trusted-server-core settings_data --target wasm32-wasip1. A local Fastly adapter-focused test could not complete because this environment has viceroy 0.16.4 while the repo pins 0.17.0.

@aram356 aram356 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Scoped, well-tested runtime hardening: a config-store read failure now maps to TrustedServerError::ConfigStoreUnavailable503 (retryable), while verify/parse failures stay 500. The real change is 4 files / +279 −13; verified locally at head (all new core + adapter tests pass). The two blocking items concern the PR's base/CI plumbing, not the code itself.

1 of the inline comments below carries a one-click GitHub suggestion. The remaining findings are body-level because they concern the PR base, CI, or code outside the diff hunks and can't be auto-applied.

Blocking

🔧 wrench

  • Wrong PR base inflates the diff to 66 files — see Cross-cutting below.
  • CI prepare integration artifacts is failing — see Cross-cutting below.

Non-blocking

⛏ nitpick

  • Chunk-read failures reuse the top-level "not seeded" hint — see inline at crates/trusted-server-core/src/settings_data.rs:95.

🤔 thinking / 🌱 seedling

  • 503 body says "An internal error occurred" — see Cross-cutting.
  • with_capacity(envelope_len) trusts an unvalidated length — see Cross-cutting.

Cross-cutting / body-level findings

  • 🔧 Wrong PR base → unreviewable diff & merge risk. The PR targets feature/ts-cli-audit, but the branch is stacked on feature/ts-cli-next (per the PR body) and is diverged from both (vs ts-cli-audit: 18 ahead / 25 behind). GitHub shows 66 files / +9,550 −1,772; the actual change is 4 files / +279 −13 (the 4 topmost commits: the ConfigStoreUnavailable variant, the read-failure reclassification, the adapter 503 lock, and the design doc). Merging against this base risks pulling ~60 unrelated files into the target, and the design-doc header ("stacked on feature/ts-cli-audit") disagrees with the PR body ("stacked on feature/ts-cli-next"). Recommend rebasing onto the true parent's current tip (so it's a clean ancestor), then retargeting the base — or main once the stack lands — so the PR shows only the 4 files.

  • 🔧 CI prepare integration artifacts is FAILING. Transitive-dependency parity drift (markup5ever, match_token, scraper, selectors, uuid, wasm-bindgen-futures, web-sys) between the workspace and trusted-server-integration-tests. The failure originates in the stacked base (integration-tests lock), not the 4-file 503 change, and isn't branch-protection-required here (base is a feature branch) — but it's red on the PR and CLAUDE.md treats it as a gate. It should clear once the branch sits on the correct, lock-reconciled parent (see above).

  • 🌱 String::with_capacity(pointer.envelope_len) (settings_data.rs:119) pre-allocates from an unvalidated length. Pre-existing and outside this diff, but squarely in this PR's "harden the config-store load" theme: a corrupted pointer with a huge envelope_len triggers a large allocation on the per-request settings-load path — a potential trap/abort instead of the graceful 503/500 this PR adds. Consider bounding envelope_len (and cross-checking it against the sum of chunk.len) before pre-allocating. Follow-up, not this PR's job.

  • 🤔 503 body reads "An internal error occurred". The generic catch-all user_message() describes a 500, but 503 is retryable. The generic body is deliberate (design §4 — no user_message arm added, which correctly avoids leaking internals), so this is only a suggestion: a 503-appropriate phrase (e.g. "Service temporarily unavailable") would let clients/monitoring distinguish retryable from terminal without leaking detail.

CI Status

  • integration tests: SKIPPED
  • integration tests (EdgeZero entry point): SKIPPED
  • browser integration tests: SKIPPED
  • prepare integration artifacts: FAIL
  • .github/dependabot.yml: PASS

Comment thread crates/trusted-server-core/src/settings_data.rs
Comment thread crates/trusted-server-core/src/error.rs
Base automatically changed from feature/ts-cli-audit to main July 1, 2026 21:09
prk-Jr added 3 commits July 2, 2026 15:15
Resolve 22 conflicts from main's EdgeZero adapter/canary evolution against
the branch's config-store 503 work.

- settings_data.rs: union the branch's ConfigStoreUnavailable (503) read
  classification with main's Fastly chunk-length hardening; keep both test sets.
- adapter-fastly/main.rs: adopt main's rollout-percentage canary and
  settings_snapshot finalize refactor (branch 503 logic lives in error.rs).
- CLI audit: adopt main's #800 implementation over the branch's parallel copy.
- Manifests/lockfiles: take main's per-platform adapters and edgezero rev e483.
- proxy.rs: drop a duplicate request_body_bytes left by a false auto-merge.

Verified: fmt, clippy (5 targets), all adapter + core + cli test suites,
integration parity, and JS vitest all pass.
- read_config_entry: reword to cover unseeded/missing/transient reads, since
  it is the shared seam for both the blob key and each chunk key.
- user_message: return 'Service temporarily unavailable' for the retryable 503
  variants (ConfigStoreUnavailable, KvStore) instead of the 500-flavored
  generic body; still leaks no internal detail. Tests updated.

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #783 against main, focusing on the config-store settings load path, HTTP status/body mapping, Fastly EdgeZero routing behavior, CI coverage, and the previously raised review threads. The core 503 classification is generally well covered and I did not find a blocking correctness, security, data-loss, authorization, or severe compatibility issue. I did find a few non-blocking issues worth addressing before merge, especially an env-config compatibility regression in store-name resolution.

Findings

P0 / Blockers

None.

P1 / High

  1. Store-name env resolution no longer honors EdgeZero fallback semantics — see inline comment on crates/trusted-server-core/src/settings_data.rs.

P2 / Medium

  1. The JA4 debug preflight still returns 500 for config-store read failurescrates/trusted-server-adapter-fastly/src/main.rs:401. This line is outside the PR diff, so I could not leave it inline, but the new ConfigStoreUnavailable behavior is incomplete on this HTTP path: GET /_ts/debug/ja4 calls load_settings_from_config_store() before app construction and hard-codes 500 Internal Server Error on every load failure instead of using e.current_context().status_code() / user_message(). In an unseeded or transient config-store case, normal EdgeZero startup responses now return 503 while this debug endpoint still returns 500. Suggested fix: map the Report<TrustedServerError> through the same IntoHttpResponse methods used by the startup/error response path.
  2. The EdgeZero integration job can pass without proving the EdgeZero entry point was used — see inline comment on crates/trusted-server-integration-tests/tests/common/ec.rs.

P3 / Low

  1. The implemented 503 body behavior contradicts the new design doc — see inline comment on docs/superpowers/specs/2026-06-27-edgezero-http-config-503-design.md.
  2. Duplicate CI/local verification steps were added — the diff adds exact duplicate host CLI clippy/test steps in .github/workflows/format.yml and .github/workflows/test.yml, duplicate Viceroy config generation in .github/workflows/integration-tests.yml / scripts/integration-tests.sh, and duplicate CLI-test docs in CLAUDE.md. This is not correctness-blocking, but it increases CI time and maintenance noise.

CI / Existing Reviews

  • GitHub checks are currently passing.
  • Existing review context: earlier automated reviews/comments raised the chunk-read hint and 503 body wording; the latest commits appear to have addressed those specific comments.
  • Local focused checks run during this review: cargo test -p trusted-server-core settings_data --target x86_64-unknown-linux-gnu, cargo test -p trusted-server-core config_store_unavailable --target x86_64-unknown-linux-gnu, and git diff --check $(git merge-base HEAD origin/main)...HEAD all passed.

Comment thread crates/trusted-server-core/src/settings_data.rs Outdated
Comment thread crates/trusted-server-integration-tests/tests/common/ec.rs
Comment thread docs/superpowers/specs/2026-06-27-edgezero-http-config-503-design.md Outdated
prk-Jr and others added 7 commits July 2, 2026 21:53
…arity

- Route default_config_store_name through EnvConfig::store_name so a blank,
  whitespace-only, or control-character EDGEZERO__STORES__CONFIG__APP_CONFIG__NAME
  override falls back to the logical id instead of opening an empty store name;
  add regression tests via the new config_store_name_from seam.
- Render EdgeZero JA4 preflight settings-load failures through to_error_response
  so an unseeded/transient config store returns 503 there like every other path,
  instead of a hard-coded 500.
- Update the design doc: 503 body is now the shared retry-flavored
  user_message arm; note the local chunk resolver duplicates edgezero's
  crate-private chunked_config and the upstream-export path to delete it.
The merge of main doubled the host-target CLI clippy/test steps in
format.yml and test.yml, the Viceroy config generation step in
integration-tests.yml and scripts/integration-tests.sh, and the
host-target CLI test snippet in CLAUDE.md. Keep one copy of each.
The config store the HTTP layer bootstraps from is opened by a hardcoded
logical id (`app_config` = DEFAULT_CONFIG_STORE_ID/CONFIG_BLOB_KEY) before
Settings load, and `ts config push` targets `[stores.config].default`. The
unified-manifest rewrite had renamed the config store to
`trusted_server_config`, so push and provision would target a store the
runtime never reads, 503-ing every request on a fresh deploy. Restore the
config id to `app_config` (matching main, the runtime, and integration
seeding) and add a guard comment. Revert the kv/secrets ids to main
(`ec_identity_store`, `secrets`) too, updating the Spin manifest test lock.

Also drop drift the branch introduced incidentally:
- Re-exclude `trusted-server-openrtb-codegen` from the workspace (main
  excludes it; dropping the exclude orphaned it so its codegen `cargo run`
  failed with "believes it's in a workspace when it's not").
- Remove the unused `get_settings_from_services` (no callers; not on main).
- Revert `body_as_reader` in proxy.rs/publisher.rs to main's signature; the
  branch had wrapped it in an infallible `Result`, inconsistent with the
  other `into_bytes().unwrap_or_default()` sites.
The Spin adapter compiled the example config in at build time
(`include_str!`), so it ignored `ts config push --adapter spin` and could
never serve real operator config. Load Settings at startup from a Spin
key-value store instead, matching the Fastly/Axum config-store path.

- Add `SpinKvConfigStore`, a Spin key-value-backed `PlatformConfigStore`
  (async `spin_sdk::key_value`, wasm+spin gated). This is distinct from the
  variable-backed `ConfigStoreHandleAdapter`: a multi-kilobyte app-config
  blob does not fit Spin's flat variable namespace.
- `build_state` now loads via `get_settings_from_config_store` on the Spin
  runtime (store id and blob key both `app_config`); native test builds keep
  the embedded example config since Spin host KV is unavailable there.
- Grant the component the `app_config` KV label in spin.toml and declare its
  backend in a new runtime-config.toml; `serve` passes `--runtime-config-file`.
- Unseeded/unreadable store fails closed (503) via the startup error router.
- Ignore `.spin/` local runtime state; update the architecture doc.

Verified: wasm build, native tests, clippy (wasm + native), and a live
`spin up` unseeded run (503). The seeded read is verified on every local
layer (the pushed blob is a valid envelope at store/key `app_config`) but
not live end-to-end: the installed Spin CLI (4.0) is newer than the Spin 2-3
SQLite KV schema that `ts config push --local` writes, so a Spin-2/3 runtime
(or an edgezero-cli update) is needed to confirm seeded -> 200.
Two gaps surfaced while validating the Spin config-store load path, both in
how the Spin adapter reports a failed startup:

- The adapter had no `log` backend (`edgezero_adapter_spin` only calls a no-op
  `init_logger`), so every `log::error!` — including the startup-error chain —
  was silently dropped, making a config-load failure undebuggable. Install a
  minimal stderr `log` backend before dispatch; Spin captures component stderr
  to `.spin/logs/` and the Fermyon Cloud log stream. Confirmed live: the
  config-store read error now prints with its full error-stack chain.

- `startup_error_router` hard-coded 503 for every startup failure, so a
  blob that reads but fails envelope/settings verification (a 500-class
  `Configuration` error) was misreported as a retryable 503. Map the response
  to the error's `status_code()` with the generic `user_message()` body,
  matching the Fastly/Axum startup routers. Add a test that a verify failure
  maps to 500, and make the existing method-coverage test use a realistic
  `ConfigStoreUnavailable` (it previously built a 400 but asserted 503, which
  only passed because of the hard-code).
@prk-Jr prk-Jr requested a review from ChristianPavilonis July 3, 2026 12:50

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #783 at head 03fc21c00528544717b398d481556842e3757026 against main, focusing on the config-store load/error-classification changes, Fastly/Axum/Spin HTTP response behavior, Spin runtime config-store integration, manifests, integration workflow changes, and prior review feedback. Overall risk looks moderate: CI is green and the main 503-vs-500 split is well covered, but I found one non-blocking correctness gap in the new Spin KV-backed app-config path where corrupt stored bytes are reported as retryable store unavailability.

Findings

P0 / Blockers

None.

P1 / High

None.

P2 / Medium

  1. Spin app-config values that are present but not UTF-8 are classified as retryable 503s — see inline comment on crates/trusted-server-adapter-spin/src/platform.rs.

P3 / Low

None.

CI / Existing Reviews

  • GitHub checks are currently passing.
  • Existing review context: the previously raised store-name fallback, JA4 preflight status mapping, design-doc body mismatch, duplicate CI-step drift, and 503 body wording comments appear addressed by later commits. The earlier EdgeZero-entry-point integration probe comment remains a non-blocking coverage concern, but I am not duplicating it here.
  • I did not run additional local test commands beyond inspecting the diff and CI results.

"key `{key}` not found in Spin key-value store `{label}`"
))
})?;
String::from_utf8(bytes).map_err(|e| {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review: Present-but-corrupt Spin app-config bytes are reported as retryable store unavailability.

This String::from_utf8 failure means the app_config key was successfully opened and read, but the stored value is corrupt/not a valid text config blob. Because the error is returned as PlatformError::ConfigStore, read_config_entry() wraps it in TrustedServerError::ConfigStoreUnavailable, so Spin returns the new retryable 503 path instead of the intended 500-class "read succeeded but reconstruct/verify failed" path. In practice, an operator who accidentally seeds binary/corrupt bytes gets misleading retryable behavior and clients may keep retrying a terminal bad config.

Suggested fix: keep open/get/missing-key failures on the 503 path, but make a present value that cannot be decoded feed the verification/configuration path (for example, have this adapter convert invalid UTF-8 into a TrustedServerError::Configuration, or decode lossily/otherwise pass bytes forward so settings_from_config_blob fails as a 500), and add a Spin-specific regression test for non-UTF-8 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants