Skip to content

Conversation

@codeshaunted
Copy link
Collaborator

@codeshaunted codeshaunted commented Dec 17, 2025

Adds a BEP for a BAML formatter, as well as a prototype CST-based formatter implementation with LSP integration.

The code for the formatter is currently contained in just a single file, and is a bit verbose in some places. It could definitely be simplified with some further abstractions. One possibility for simplifying this could be creating known-valid AST/CST wrappers for SyntaxNodes that are only constructed if the code parses without errors. With this it would be far easier to get specific tokens from CST nodes, and could possibly provide an interface for future code transformation tooling (see: rust-clippy).

Known Issues/Limitations

  • Comments that are on their own lines are not attached properly in some cases
  • User-intended whitespace is ignored currently
  • For-loop initializer, condition, and update are printed verbatim
  • Minor spacing issues on config blocks
  • No current implementation for any sort of line wrapping
  • Comments are not allowed in some places where we may want to respect comments in the future
  • No current tests for transforming specific constructs, tests are limited in scope
  • Does not handle dedenting on raw strings

Some of these are due to limitations of/possible issues with the current parser, such as:

  • expressions are not properly parsed for config values, they are just parsed as tokens verbatim
  • string literals are not allowed as keys in config blocks
  • void functions are not allowed

How it works

At a basic level this traverses the CST top-down, calling a specific format function for each type of SyntaxNode. Each node-specific formatting function generally iterates over its child nodes and tokens, selectively parsing semantics and outputting equivalent code.

All formatted code is pushed to the output alongside its span with push_format. The end of the last span that was formatted is also stored internally. push_format first searches for unprocessed trivia in the range between the end of the last formatted span by calling format_missing, which has logic for reattaching comments and other trivia. Once this preceding trivia is processed, it then will add the formatted span to the final output. This approach ensures that we do not lose any comments.

format_node, and format_token are helper functions that automatically iterate over a provided children iterator, consuming until it reaches a specified type of SyntaxNode or SyntaxToken. Once it reaches this specified type, it will call a provided format function.

Sometimes it's not possible to use these helper functions because of various extra context that may be needed for formatting. In this case, the children are manually iterated over for formatting.

@vercel
Copy link

vercel bot commented Dec 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
promptfiddle Ready Ready Preview, Comment Dec 22, 2025 9:27pm

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

Latest build completeView logs

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: 4e71b9417b911e2ebfbe36702b4e9e3b6a5a8fb9Workflow run

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: f81691f2100f455a177a060463ed954722b370acWorkflow run

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: c5e54a20e9b18ae2152c381e2d8d2e13fb193222Workflow run

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 19, 2025

CodSpeed Performance Report

Merging #2830 will not alter performance

Comparing formatter (89f66af) with canary (1f09944)

Summary

✅ 15 untouched
⏩ 14 skipped1

Footnotes

  1. 14 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Comment on lines 72 to 110
while current_pos < start {
let token = self.root.token_at_offset(current_pos).right_biased();

if let Some(token) = token {
// check if token is within our target range and fix trivia if necessary
if token.text_range().start() < start {
match token.kind() {
SyntaxKind::NEWLINE => on_same_line = false,
SyntaxKind::LINE_COMMENT | SyntaxKind::BLOCK_COMMENT => {
if !on_same_line {
self.push_text(format!("\n{}", self.gen_indent()));
} else {
self.push_text(" ".to_string());
}

self.push_text(token.text().to_string());
}
_ => (), // throw away all other tokens
}
current_pos = token.text_range().end();
} else {
break;
}
} else {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly suggest rewriting this using let-else early exits to avoid so much code nesting:

while current_pos < start {
    let token = self.root.token_at_offset(current_pos).right_biased();

    let Some(token) = token else {
        break;
    };

    // check if token is within our target range and fix trivia if necessary
    if token.text_range().start() >= start {
        break;
    }

    match token.kind() {
        SyntaxKind::NEWLINE => on_same_line = false,

        SyntaxKind::LINE_COMMENT | SyntaxKind::BLOCK_COMMENT => {
            if !on_same_line {
                self.push_text(format!("\n{}", self.gen_indent()));
            } else {
                self.push_text(" ".to_string());
            }

            self.push_text(token.text().to_string());
        }

        _ => {} // throw away all other tokens
    }

    current_pos = token.text_range().end();
}

It's a nit but flat code is much easier to follow than nested.

Comment on lines 2460 to 2462
while parser.current < parser.tokens.len() {
let token = &parser.tokens[parser.current];
let kind = token_kind_to_syntax_kind(token.kind);
parser.events.push(Event::Token {
kind,
text: token.text.clone(),
});
parser.current += 1;
// let token = &parser.tokens[parser.current];
// let kind = token_kind_to_syntax_kind(token.kind);
// parser.events.push(Event::Token {
// kind,
// text: token.text.clone(),
// });
// parser.current += 1;
parser.bump();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary comment or code here was broken? bump() impl seems to add the Event so I suppose this piece was wrong. We can remove the commented code if so.

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: 3c583bf2c09ce2b3532635f8dc03a94b1fa5e0f3Workflow run

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: 0e36893c3a62e17c24d44df66c4d9094d14c36d4Workflow run

@github-actions
Copy link

🐑 BEPs Preview Ready

Preview URL: https://dj7ggjkp4tlhz.cloudfront.net/formatter/

Commit: 1c463eb17bfc45013c88eb11a97d66f1269b9b7dWorkflow run

@skip
// Stream done attribute
summary string
@stream.done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do attributes always go to the next line? I think for cases where you have a few attributes it's better to put them in the same line as the field, cases where you have many attributes and one line becomes unreadable then jumping to the next one does help. Maybe it could even be indented:

class Example {
    f1 string @one_attr

    f2 string
        @attr_1
        @attr_2
        @attr_3
        @attr_4
}

Hard case is with trailing comments:

// Original
class Example {
    f2 string @attr_1 @attr_2 @attr_3 // Trailing comment

Comment on lines +468 to +474
self.format_node(
children,
SyntaxKind::TYPE_EXPR,
false,
false,
Self::format_type_expr,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit here, these types of functions that take 4+ parameters are kinda hard to read when literal booleans are passed in, LSP can help with inlay hints for param names but otherwise reading false, false doesn't make it clear what is false. Naming the parameters would make it clear at every call site and also helps LLM codegen these calls:

self.format_node(FormatArgs {
    children,
    syntax_kind: SyntaxKind::TYPE_EXPR,
    prepend_newline: false,
    should_collect_preceding_trivia: false,
    f: Self::format_type_expr
});

You could also use defaults:

self.format_node(FormatArgs {
    children,
    syntax_kind: SyntaxKind::TYPE_EXPR,
    f: Self::format_type_expr,
    ..FormatArgs::default()
});

But being explicit makes it more clear.

Not important though, we merge as is and we may refactor it later.

if should_collect_preceding_trivia {
f.format_missing(token.text_range().start());
}
f.push_text("\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one appears multiple times, and since line breaks can be different in each OS / encoding, I'd make it a function just in case: f.push_line_break().

Same applies for f.push_white_space().

Comment on lines +5 to +22
=== ORIGINAL ===
function Foo() -> int {
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
x + 1;;;;;;
}

=== FORMATTED ===
function Foo() -> int {
x + 1;
x + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break on C style for loops with no expressions? They're usually used for infinite loops:

for (;;) {
    // Infinite loop
}

Comment on lines +116 to +122
// get rid of duplicate semicolons
SyntaxKind::SEMICOLON => {
if !already_pushed_semicolon {
self.push_text(";");
already_pushed_semicolon = true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to special case infinite for-loops? for (;;) {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants