Index GRIB1 messages in scan_grib (edition-aware _split_file) by atverm · Pull Request #587 · fsspec/kerchunk

atverm · 2026-06-21T00:30:07Z

Summary

scan_grib currently indexes GRIB2 only — GRIB1 messages are silently skipped. This teaches the message splitter to read both editions, so GRIB1 (and mixed GRIB1/GRIB2) files work. References point at the original GRIB1 bytes; no conversion or companion file is needed.

Refs #358 — this is very likely the cause behind "variables present in cfgrib but missing from scan_grib"; maintainers can confirm/close if that file is mixed-edition.

Why

_split_file locates message boundaries by reading the total-length field from the GRIB2 indicator section (64-bit length, low word at bytes 12–15). GRIB1 stores its length as a 24-bit value at bytes 4–6 (plus ECMWF's "large message" extension), so a GRIB1 message gets a wrong length, the scan desyncs, and the message is dropped.

Crucially, the decode path is already edition-agnostic — GRIBCodec → eccodes.codes_new_from_message reads GRIB1 and GRIB2 alike — so only the splitter needs fixing; no new dependency and no change to how chunks are decoded.

This matters because a lot of archived data is still GRIB1 (ERA-Interim, older ECMWF/NCEP/JRA reanalyses, many NWP archives), and rewriting it to GRIB2 just to use kerchunk is wasteful.

What changed

kerchunk/grib2.py::_split_file now:

reads the edition byte (offset 7) and parses the length per edition — GRIB1: 24-bit at bytes 4–6 with the ECMWF & 0x800000 → × 120 large-message rule; GRIB2: full 64-bit length at bytes 8–15 (widened from the low 32 bits — identical for < 4 GiB, correct above);
breaks on a final partial read with no GRIB marker. This fixes a latent infinite loop: trailing bytes after the last message make the existing seek(-4) "marker may straddle the read boundary" branch bounce between size-4 and size. It was masked before because the GRIB2-only mis-parse of GRIB1 messages happened to land elsewhere at EOF.

Testing

Adds two cases to tests/test_grib.py that build GRIB1 and mixed GRIB1+GRIB2 samples with eccodes (regular_ll_sfc_grib1 / regular_ll_sfc_grib2) and assert the messages are indexed and decode.

On a real ECMWF file mixing GRIB2 model levels with GRIB1 surface fields:

	before	after
`scan_grib(file)`	470 refs (GRIB2 only)	497 (all messages; matches `eccodes`)
`scan_grib(file, filter={'shortName': '2t'})`	0	1 — chunk `[file, 94441288, 195588]`, decodes to 6.9 °C

Notes for reviewers

The scaled-length branch mirrors eccodes' own GRIB1 length logic but only triggers for messages > ~8 MiB, which the small fixtures don't reach. Happy to add a large-message fixture if you'd like it covered.
An alternative is to delegate framing to eccodes (already a scan_grib dependency) — robust against every edge case but a larger behavioural change. This keeps the existing byte-scan approach and the diff small; glad to switch if you'd prefer the eccodes route.

_split_file read the total-length field from the GRIB2 indicator section only (64-bit length, low word at bytes 12-15), so GRIB1 messages (24-bit length at bytes 4-6, plus ECMWF's scaled-length extension) got a wrong length, desynced the scan, and were dropped. The decode path (GRIBCodec -> eccodes) already reads both editions, so only the splitter needed to be edition-aware. Also adds an EOF guard: trailing bytes after the last message could spin the seek(-4) "marker may straddle the boundary" branch forever (latent; it only surfaced once GRIB1 messages were parsed correctly). Refs fsspec#358 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

martindurant · 2026-06-22T19:34:44Z

What a long summary for a small code change :).

I don't know about GRIB1, I'd be glad if someone more familiar could chime in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index GRIB1 messages in scan_grib (edition-aware _split_file)#587

Index GRIB1 messages in scan_grib (edition-aware _split_file)#587
atverm wants to merge 1 commit into
fsspec:mainfrom
atverm:grib1-edition-aware-split

atverm commented Jun 21, 2026

Uh oh!

martindurant commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

atverm commented Jun 21, 2026

Summary

Why

What changed

Testing

Notes for reviewers

Uh oh!

martindurant commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants