fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments by mem · Pull Request #4314 · sqlc-dev/sqlc

mem · 2026-02-25T02:19:58Z

ANTLR4-go stores its input stream as []rune, so all token positions returned by GetStart().GetStart() and GetStop().GetStop() are rune indices, not byte offsets. The SQLite parser was storing these values directly as StmtLocation and StmtLen, which are later consumed by source.Pluck() using byte-based Go string slicing (source[head:tail]).

For source files that contain multi-byte UTF-8 characters (non-ASCII) in comments, the rune index diverges from the byte offset, causing the plucked query text to be truncated. Each 2-byte character (e.g. Ü, é) caused one byte to be dropped from the end of the query; each 3-byte character (e.g. ♥) caused two bytes to be dropped; and so on.

Fix this by building a rune-index to byte-offset map from the source string before processing the ANTLR parse tree, then converting the ANTLR rune positions to byte offsets before storing them in the AST. The internal loc tracking variable continues to use rune indices (for consistency with the ANTLR token positions), while only the values written into StmtLocation and StmtLen are converted to byte offsets.

Add TestParseNonASCIIComment covering 2-, 3-, and 4-byte characters in dash comments, multiple non-ASCII characters, and the multi-statement case where an incorrect loc for one statement would propagate and corrupt the StmtLocation of the following statement.

… comments ANTLR4-go stores its input stream as []rune, so all token positions returned by GetStart().GetStart() and GetStop().GetStop() are rune indices, not byte offsets. The SQLite parser was storing these values directly as StmtLocation and StmtLen, which are later consumed by source.Pluck() using byte-based Go string slicing (source[head:tail]). For source files that contain multi-byte UTF-8 characters (non-ASCII) in comments, the rune index diverges from the byte offset, causing the plucked query text to be truncated. Each 2-byte character (e.g. Ü, é) caused one byte to be dropped from the end of the query; each 3-byte character (e.g. ♥) caused two bytes to be dropped; and so on. Fix this by building a rune-index to byte-offset map from the source string before processing the ANTLR parse tree, then converting the ANTLR rune positions to byte offsets before storing them in the AST. The internal loc tracking variable continues to use rune indices (for consistency with the ANTLR token positions), while only the values written into StmtLocation and StmtLen are converted to byte offsets. Add TestParseNonASCIIComment covering 2-, 3-, and 4-byte characters in dash comments, multiple non-ASCII characters, and the multi-statement case where an incorrect loc for one statement would propagate and corrupt the StmtLocation of the following statement.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🔧 golang labels Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments#4314

fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments#4314
mem wants to merge 1 commit intosqlc-dev:mainfrom
mem:mem/fix-comment-parsing

mem commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mem commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant