Skip to content

Index GRIB1 messages in scan_grib (edition-aware _split_file)#587

Open
atverm wants to merge 1 commit into
fsspec:mainfrom
atverm:grib1-edition-aware-split
Open

Index GRIB1 messages in scan_grib (edition-aware _split_file)#587
atverm wants to merge 1 commit into
fsspec:mainfrom
atverm:grib1-edition-aware-split

Conversation

@atverm

@atverm atverm commented Jun 21, 2026

Copy link
Copy Markdown

Summary

scan_grib currently indexes GRIB2 only — GRIB1 messages are silently skipped. This teaches the message splitter to read both editions, so GRIB1 (and mixed GRIB1/GRIB2) files work. References point at the original GRIB1 bytes; no conversion or companion file is needed.

Refs #358 — this is very likely the cause behind "variables present in cfgrib but missing from scan_grib"; maintainers can confirm/close if that file is mixed-edition.

Why

_split_file locates message boundaries by reading the total-length field from the GRIB2 indicator section (64-bit length, low word at bytes 12–15). GRIB1 stores its length as a 24-bit value at bytes 4–6 (plus ECMWF's "large message" extension), so a GRIB1 message gets a wrong length, the scan desyncs, and the message is dropped.

Crucially, the decode path is already edition-agnostic — GRIBCodececcodes.codes_new_from_message reads GRIB1 and GRIB2 alike — so only the splitter needs fixing; no new dependency and no change to how chunks are decoded.

This matters because a lot of archived data is still GRIB1 (ERA-Interim, older ECMWF/NCEP/JRA reanalyses, many NWP archives), and rewriting it to GRIB2 just to use kerchunk is wasteful.

What changed

kerchunk/grib2.py::_split_file now:

  • reads the edition byte (offset 7) and parses the length per edition — GRIB1: 24-bit at bytes 4–6 with the ECMWF & 0x800000 → × 120 large-message rule; GRIB2: full 64-bit length at bytes 8–15 (widened from the low 32 bits — identical for < 4 GiB, correct above);
  • breaks on a final partial read with no GRIB marker. This fixes a latent infinite loop: trailing bytes after the last message make the existing seek(-4) "marker may straddle the read boundary" branch bounce between size-4 and size. It was masked before because the GRIB2-only mis-parse of GRIB1 messages happened to land elsewhere at EOF.

Testing

Adds two cases to tests/test_grib.py that build GRIB1 and mixed GRIB1+GRIB2 samples with eccodes (regular_ll_sfc_grib1 / regular_ll_sfc_grib2) and assert the messages are indexed and decode.

On a real ECMWF file mixing GRIB2 model levels with GRIB1 surface fields:

before after
scan_grib(file) 470 refs (GRIB2 only) 497 (all messages; matches eccodes)
scan_grib(file, filter={'shortName': '2t'}) 0 1 — chunk [file, 94441288, 195588], decodes to 6.9 °C

Notes for reviewers

  • The scaled-length branch mirrors eccodes' own GRIB1 length logic but only triggers for messages > ~8 MiB, which the small fixtures don't reach. Happy to add a large-message fixture if you'd like it covered.
  • An alternative is to delegate framing to eccodes (already a scan_grib dependency) — robust against every edge case but a larger behavioural change. This keeps the existing byte-scan approach and the diff small; glad to switch if you'd prefer the eccodes route.

_split_file read the total-length field from the GRIB2 indicator section only
(64-bit length, low word at bytes 12-15), so GRIB1 messages (24-bit length at
bytes 4-6, plus ECMWF's scaled-length extension) got a wrong length, desynced
the scan, and were dropped. The decode path (GRIBCodec -> eccodes) already
reads both editions, so only the splitter needed to be edition-aware.

Also adds an EOF guard: trailing bytes after the last message could spin the
seek(-4) "marker may straddle the boundary" branch forever (latent; it only
surfaced once GRIB1 messages were parsed correctly).

Refs fsspec#358

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@martindurant

Copy link
Copy Markdown
Member

What a long summary for a small code change :).

I don't know about GRIB1, I'd be glad if someone more familiar could chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants