Index GRIB1 messages in scan_grib (edition-aware _split_file)#587
Open
atverm wants to merge 1 commit into
Open
Index GRIB1 messages in scan_grib (edition-aware _split_file)#587atverm wants to merge 1 commit into
atverm wants to merge 1 commit into
Conversation
_split_file read the total-length field from the GRIB2 indicator section only (64-bit length, low word at bytes 12-15), so GRIB1 messages (24-bit length at bytes 4-6, plus ECMWF's scaled-length extension) got a wrong length, desynced the scan, and were dropped. The decode path (GRIBCodec -> eccodes) already reads both editions, so only the splitter needed to be edition-aware. Also adds an EOF guard: trailing bytes after the last message could spin the seek(-4) "marker may straddle the boundary" branch forever (latent; it only surfaced once GRIB1 messages were parsed correctly). Refs fsspec#358 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
|
What a long summary for a small code change :). I don't know about GRIB1, I'd be glad if someone more familiar could chime in. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scan_gribcurrently indexes GRIB2 only — GRIB1 messages are silently skipped. This teaches the message splitter to read both editions, so GRIB1 (and mixed GRIB1/GRIB2) files work. References point at the original GRIB1 bytes; no conversion or companion file is needed.Refs #358— this is very likely the cause behind "variables present in cfgrib but missing from scan_grib"; maintainers can confirm/close if that file is mixed-edition.Why
_split_filelocates message boundaries by reading the total-length field from the GRIB2 indicator section (64-bit length, low word at bytes 12–15). GRIB1 stores its length as a 24-bit value at bytes 4–6 (plus ECMWF's "large message" extension), so a GRIB1 message gets a wrong length, the scan desyncs, and the message is dropped.Crucially, the decode path is already edition-agnostic —
GRIBCodec→eccodes.codes_new_from_messagereads GRIB1 and GRIB2 alike — so only the splitter needs fixing; no new dependency and no change to how chunks are decoded.This matters because a lot of archived data is still GRIB1 (ERA-Interim, older ECMWF/NCEP/JRA reanalyses, many NWP archives), and rewriting it to GRIB2 just to use kerchunk is wasteful.
What changed
kerchunk/grib2.py::_split_filenow:& 0x800000 → × 120large-message rule; GRIB2: full 64-bit length at bytes 8–15 (widened from the low 32 bits — identical for < 4 GiB, correct above);GRIBmarker. This fixes a latent infinite loop: trailing bytes after the last message make the existingseek(-4)"marker may straddle the read boundary" branch bounce betweensize-4andsize. It was masked before because the GRIB2-only mis-parse of GRIB1 messages happened to land elsewhere at EOF.Testing
Adds two cases to
tests/test_grib.pythat build GRIB1 and mixed GRIB1+GRIB2 samples with eccodes (regular_ll_sfc_grib1/regular_ll_sfc_grib2) and assert the messages are indexed and decode.On a real ECMWF file mixing GRIB2 model levels with GRIB1 surface fields:
scan_grib(file)eccodes)scan_grib(file, filter={'shortName': '2t'})[file, 94441288, 195588], decodes to 6.9 °CNotes for reviewers
scan_gribdependency) — robust against every edge case but a larger behavioural change. This keeps the existing byte-scan approach and the diff small; glad to switch if you'd prefer the eccodes route.