Skip to content

fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments#4314

Open
mem wants to merge 1 commit intosqlc-dev:mainfrom
mem:mem/fix-comment-parsing
Open

fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments#4314
mem wants to merge 1 commit intosqlc-dev:mainfrom
mem:mem/fix-comment-parsing

Conversation

@mem
Copy link

@mem mem commented Feb 25, 2026

ANTLR4-go stores its input stream as []rune, so all token positions returned by GetStart().GetStart() and GetStop().GetStop() are rune indices, not byte offsets. The SQLite parser was storing these values directly as StmtLocation and StmtLen, which are later consumed by source.Pluck() using byte-based Go string slicing (source[head:tail]).

For source files that contain multi-byte UTF-8 characters (non-ASCII) in comments, the rune index diverges from the byte offset, causing the plucked query text to be truncated. Each 2-byte character (e.g. Ü, é) caused one byte to be dropped from the end of the query; each 3-byte character (e.g. ♥) caused two bytes to be dropped; and so on.

Fix this by building a rune-index to byte-offset map from the source string before processing the ANTLR parse tree, then converting the ANTLR rune positions to byte offsets before storing them in the AST. The internal loc tracking variable continues to use rune indices (for consistency with the ANTLR token positions), while only the values written into StmtLocation and StmtLen are converted to byte offsets.

Add TestParseNonASCIIComment covering 2-, 3-, and 4-byte characters in dash comments, multiple non-ASCII characters, and the multi-statement case where an incorrect loc for one statement would propagate and corrupt the StmtLocation of the following statement.

… comments

ANTLR4-go stores its input stream as []rune, so all token positions
returned by GetStart().GetStart() and GetStop().GetStop() are rune
indices, not byte offsets. The SQLite parser was storing these values
directly as StmtLocation and StmtLen, which are later consumed by
source.Pluck() using byte-based Go string slicing (source[head:tail]).

For source files that contain multi-byte UTF-8 characters (non-ASCII)
in comments, the rune index diverges from the byte offset, causing the
plucked query text to be truncated. Each 2-byte character (e.g. Ü, é)
caused one byte to be dropped from the end of the query; each 3-byte
character (e.g. ♥) caused two bytes to be dropped; and so on.

Fix this by building a rune-index to byte-offset map from the source
string before processing the ANTLR parse tree, then converting the
ANTLR rune positions to byte offsets before storing them in the AST.
The internal loc tracking variable continues to use rune indices (for
consistency with the ANTLR token positions), while only the values
written into StmtLocation and StmtLen are converted to byte offsets.

Add TestParseNonASCIIComment covering 2-, 3-, and 4-byte characters in
dash comments, multiple non-ASCII characters, and the multi-statement
case where an incorrect loc for one statement would propagate and
corrupt the StmtLocation of the following statement.
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🔧 golang labels Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files. 🔧 golang

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant