Skip to content

fix: drop partial multibyte char when write length is too small#374

Open
spokodev wants to merge 1 commit into
feross:masterfrom
spokodev:fix/write-explicit-length-partial-char
Open

fix: drop partial multibyte char when write length is too small#374
spokodev wants to merge 1 commit into
feross:masterfrom
spokodev:fix/write-explicit-length-partial-char

Conversation

@spokodev

Copy link
Copy Markdown

Problem

Buffer#write(string, offset, length, encoding) with an explicit length smaller than the next character's byte length writes a partial, corrupt multibyte sequence and over-reports the byte count.

The Node docs state: "If buf did not contain enough space to fit the entire string, only part of string will be written. However, partially encoded characters will not be written." The return value counts only complete characters.

Repro (node Buffer oracle vs this shim)

const Node = require('buffer').Buffer   // built-in oracle
const Shim = require('buffer/')         // this package

// 'é' is 2 utf8 bytes (c3 a9); length 1 cannot hold it
Node.alloc(8, 0).write('é', 0, 1, 'utf8')  // 0, buffer stays 00 00 ...
Shim.alloc(8, 0).write('é', 0, 1, 'utf8')  // 1, buffer = c3 00 ...   (lone lead byte)

// '€' is 3 utf8 bytes (e2 82 ac); length 2 cannot hold it
Node.alloc(8, 0).write('€', 0, 2, 'utf8')  // 0, buffer stays 00 00 ...
Shim.alloc(8, 0).write('€', 0, 2, 'utf8')  // 2, buffer = e2 82 00 ...

// 'ab' utf16le is 4 bytes; length 3 holds one full 2-byte unit only
Node.alloc(8, 0).write('ab', 0, 3, 'utf16le')  // 2, buffer = 61 00 00 ...
Shim.alloc(8, 0).write('ab', 0, 3, 'utf16le')  // 3, buffer = 61 00 62 ...

Root cause

utf8Write and ucs2Write passed buf.length - offset (the whole remaining buffer) as the byte cap to utf8ToBytes / utf16leToBytes, ignoring the explicit length:

function utf8Write (buf, string, offset, length) {
  return blitBuffer(utf8ToBytes(string, buf.length - offset), buf, offset, length)
}

So the encoders always generated the full multibyte sequence for the next character, and blitBuffer then copied only length bytes of it, truncating the character mid-sequence and returning the truncated byte count.

utf8ToBytes and utf16leToBytes already cap output atomically per character via their units argument (a multibyte char is pushed only if all of its bytes fit). Passing the effective length as that cap makes the encoder drop an incomplete trailing character instead of emitting a partial one.

Fix

function utf8Write (buf, string, offset, length) {
  return blitBuffer(utf8ToBytes(string, length), buf, offset, length)
}

function ucs2Write (buf, string, offset, length) {
  return blitBuffer(utf16leToBytes(string, length), buf, offset, length)
}

Verification

  • Added a test in test/write.js asserting node parity for the cases above. It fails on master (6 failing assertions) and passes with the fix.
  • Full suite green: npm test reports all assertions passing.
  • Differential sweep against the built-in Buffer over 5 encodings, 18 strings (ASCII, multibyte, surrogate pairs, emoji, lone surrogates), offsets 0 to 4, and lengths 0 to 12 (5940 cases): zero mismatches in return value and buffer contents.

Closed issue #48 covered the implicit length case (a buffer too small for the whole string, no length argument); that path was already fixed by the buf.length - offset cap. This is the distinct case where an explicit length argument is smaller than a single character.

Buffer#write(string, offset, length, encoding) with an explicit length
smaller than the next character's byte length wrote a partial, corrupt
multibyte sequence and over-reported the byte count.

utf8Write/ucs2Write passed buf.length - offset (the whole remaining
buffer) as the byte cap to utf8ToBytes/utf16leToBytes, so a full
multibyte char was always generated and then byte-truncated mid-sequence
by blitBuffer. Node guarantees partially encoded characters are never
written and counts only complete characters in the return value.

Pass the effective length as the cap so an incomplete trailing character
is dropped, matching node Buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant