fix: preserve leading newline of multiline string built with string()#551
Open
Sanjays2402 wants to merge 1 commit into
Open
fix: preserve leading newline of multiline string built with string()#551Sanjays2402 wants to merge 1 commit into
Sanjays2402 wants to merge 1 commit into
Conversation
`tomlkit.string(value, multiline=True)` rendered the value as
`"""<value>"""`. When `value` starts with a newline the output looks
like `"""\n..."""`, and per the TOML spec a newline immediately
following the opening delimiter is trimmed on parse. So a value that
begins with a newline was silently lost on the very next round-trip:
>>> s = tomlkit.string("\nfoo", multiline=True)
>>> str(s)
'\nfoo'
>>> s.as_string()
'"""\nfoo"""'
>>> str(tomlkit.parse(f"k = {s.as_string()}")["k"])
'foo' # leading newline dropped
The value the String object reports is correct; only its serialized
form was wrong, so the corruption only surfaced after re-parsing.
Both tomllib and tomlkit's own parser agree the old output decodes to
the wrong value, so this is a genuine serialization bug (not a TOML
1.0 vs 1.1 difference). Affects basic and literal multiline strings,
including values that start with `\r\n`.
Fix: in `String.from_raw`, when the string is multiline and its
rendered body starts with a newline, emit an extra leading newline
(the one the parser trims) so the value survives round-tripping. This
is exactly how the spec suggests representing such a value. Single-line
strings and multiline strings without a leading newline are unchanged.
An existing `test_create_string` case asserted the old lossy output
(`"""\nMy\nString\n"""`); it only checked `as_string()`, never a
round-trip, so it had locked in the bug. Updated it to the correct,
round-trippable rendering and added a dedicated round-trip regression
test covering basic/literal multiline and `\r\n`-leading values.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
tomlkit.string(value, multiline=True)produces output that silently loses a leading newline on round-trip.Per the TOML spec: "A newline immediately following the opening delimiter will be trimmed." So
"""\nfoo"""decodes tofoo, not\nfoo. TheStringobject reports the right value, but its rendered form is wrong, so the data loss only appears after the output is parsed again.This is a genuine serialization bug, not a TOML 1.0 vs 1.1 difference — both
tomllib(1.0.0) and tomlkit's own parser (1.1.0) decode the old output to the wrong value:string(value, multiline=True).as_string()"\n"→"""\n"""""❌"\n"✅"\nabc"→"""\nabc""""abc"❌"\nabc"✅"\r\nx"→"""\r\nx""""x"❌"\r\nx"✅"abc\n"→"""abc\n""""abc\n"✅ (unchanged)"abc\n"✅Affects both basic (
""") and literal (''') multiline strings, and values starting with\r\n.Root cause
In
String.from_raw(tomlkit/items.py), the rendered body is placed directly between the delimiters. For a multiline type,\n/\rare intentionally kept literal (not escaped) for readability — but nothing accounts for the parser's rule that trims the first newline after the opening delimiter, so a body beginning with a newline is one newline short after re-parsing.Fix
When the string is multiline and its rendered body begins with a newline, emit an extra leading newline — the one the parser trims — so the intended value survives the round-trip. This is exactly the representation the spec prescribes for such values. Single-line strings and multiline strings that don't start with a newline are unchanged (surgical, +7 lines).
Tests
test_create_stringcase asserted the old lossy output ("""\nMy\nString\n"""). It only checkedas_string(), never a round-trip, so it had locked in the bug. Updated it to the correct, round-trippable rendering.test_create_multiline_string_with_leading_newline_round_trips, covering basic/literal multiline and\r\n-leading values, assertingparse(render(v)) == v.Proof the regression test guards the fix (source change stashed, tests kept):
Full suite: 1040 passed (was 1035),
ruff checkclean,ruff format --checkclean on changed files.