feat(examples): end-to-end Spark bridge demo (6/6)#109
Draft
timsaucer wants to merge 6 commits into
Draft
Conversation
…e-common Move the standalone `native` crate into a root Cargo workspace and extract shared JNI plumbing (error->exception mapping, Tokio runtime singleton, StreamingReader) into a new `datafusion-jni-common` crate under `native-common/`. `native/src/errors.rs` moves to `native-common/src/errors.rs`; the nine native modules now import error/runtime helpers from `datafusion_jni_common`. Build glue follows: single root `Cargo.lock`, `.cargo/config.toml` redirects output to `rust-target/`, Makefile/CI/poms updated to build `--workspace` and target `-p datafusion-jni`. Core javadoc build commands updated to match. Pure refactor; no behavior change. First of a 6-PR stack splitting the Spark DataSource V2 connector work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New `spark/bridge` workspace crate providing the `export_bridge!` macro that generates the six JNI entry points a Spark connector bridge exposes (providerSchemaIpc, createScan, partitionCount, executeStreamPartition, executeStream, closeScan). Includes the options decoder, scan planning/execution glue, and the Arrow type-widening layer (wraps any TableProvider for Spark type compatibility). Self-contained SDK with no Java/Scala coupling. Depends only on datafusion-jni-common. Second of the 6-PR connector stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce the `spark` Maven module and the pure-Java contracts a bridge implements: BridgeProviderFactory (no-arg factory + scanBackend()), ScanBackend (delegates to the bridge's JNI methods), NativeLibraryLoader (cdylib extraction/loading), OptionsCodec (cross-language options encoder), PartitionInfo (one entry per Spark task), and ReportedPartitioning (optional shuffle-elision declaration). Compiles standalone with no Scala main yet. Includes the two SPI-only tests (OptionsCodecTest, BridgeProviderFactoryDefaultsTest). Third of the 6-PR stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The connector implementation on top of the Java SPI and the bridge SDK: DatafusionSource/Table/Scan/ScanBuilder DSv2 wiring, per-partition columnar read path (FfiStream + Arrow->Spark batch conversion), V2 predicate pushdown (SparkPredicateTranslator), shared-scan mode with a per-executor refcounted cache (SharedScanCache, SharedScanPartitionReader, NativeSharedScanResources, PinnedSessionConfig), and SupportsReportPartitioning for shuffle elision. These pieces share the DatafusionScanMode sealed trait and the scan builder, so they land together. Includes the connector test suite and the module README. DataSourceRegister SPI file registers DatafusionSource. Fourth of the 6-PR stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `spark/scaffold/new_bridge.py` plus the `bridge-template/` it stamps out: a standalone Maven+Cargo bridge project wired to the datafusion-spark-bridge SDK — a Rust cdylib with `export_bridge!` + a demo in-memory provider, the four Java classes, the DataSourceRegister service file, a shaded-jar pom that bundles the cdylib, and a pyspark smoke test. Stdlib-only generator. Standalone tooling. Fifth of the 6-PR stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a worked example exercising the full stack: examples/native cdylib (a demo provider built on datafusion-spark-bridge, added as a workspace member), the ExampleBridge Java classes implementing the SPI, a pyspark bridge_demo.py end-to-end smoke test, and READMEs. Validates connector + bridge + SDK together. Last of the 6-PR stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 12, 2026
Member
Author
|
Repo-layout docs follow-up (from #104 review): #104 trimmed
|
timsaucer
added a commit
to timsaucer/datafusion-java
that referenced
this pull request
Jun 12, 2026
Address review feedback on the workspace-foundation PR: - development.md: trim the repo-layout section to the crates this PR actually ships (native, native-common). It was forward-referencing spark/, spark/bridge, datafusion-spark-bridge, and examples/native -- none of which exist until later PRs in the stack -- and called the member list "three" while listing four. Later PRs (apache#105/apache#106/apache#107/apache#109) carry notes to re-add their own slice when those dirs land. - rat_exclude_files.txt: the Rust lockfile moved to the workspace root, so the stale native/Cargo.lock entry left the root Cargo.lock with no RAT exclude for the source-tarball check (check-rat-report.py). Point it at Cargo.lock. - native-common: dedupe the panic-payload downcast -- StreamingReader::next now calls errors::panic_message instead of repeating the String/&str match inline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
A worked example exercising the whole stack end to end: an
examples/nativecdylib (demo provider on the SDK, added as a workspace member), theExampleBridgeJava classes implementing the SPI, and a pysparkbridge_demo.pysmoke test with READMEs. The reference integration.🤖 Generated with Claude Code