Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unknown start of token error message for invisible characters #2637

Merged
merged 5 commits into from
Feb 25, 2025

Conversation

CramBL
Copy link
Contributor

@CramBL CramBL commented Feb 23, 2025

Resolves #1016

The approach I implemented is displaying the unknown token in single quotes, but if it's not an ASCII character then it is also escaped and displayed in parenthesis.

Invisible character

Before

error: Unknown start of token:
  |
1 | test:
  | ^

After

error: Unknown start of token '' (\u{200b}):
  |
1 | test:
  | ^

ASCII character

Before

error: Unknown start of token:
  |
1 | %test:
  | ^

After

error: Unknown start of token '%':
  |
1 | %test:
  | ^

@CramBL CramBL force-pushed the improve-start-of-token-err-msg branch 2 times, most recently from 4569d6d to d99c184 Compare February 23, 2025 12:45
Copy link
Owner

@casey casey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, see comments!

@@ -153,7 +153,9 @@ pub(crate) enum CompileErrorKind<'src> {
UnknownSetting {
setting: &'src str,
},
UnknownStartOfToken,
UnknownStartOfToken {
token: char,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this start? Since it's not really a whole token. Also, start is the name of the variable in lexer.rs, so you can use shorthand initializer syntax there.

if token.is_ascii() {
format!("'{token}'")
} else {
format!("'{token}' ({})", token.escape_unicode())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the escape_unicode format isn't super useful. Let's use the U+NNNN representation, which is commonly used to represent unicode codepoints. NNNN should be four hex digits. Leading zeros should not be elided.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's use char::is_ascii_graphic as the test for whether to print the unicode codepoint. This will cause it to be printed for ASCII control characters, as well as ascii whitespace, which I think is probably good.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this should just be done with two calls to write:

write!(f, "Unknown start of token '{start}')?;
if !start.is_ascii_graphic() {
  // print (U+NNNN)
}
Ok(())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. It needs 3 write! however, since there's a colon at the end

Copy link
Owner

@casey casey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, missed something. New tests should use the Test struct directly. Also, you can remove the : at the end of the error message. I'm not sure why it's there, and other messages don't have it.

tests/misc.rs Outdated
@@ -1679,6 +1679,36 @@ assembly_source_files = %(wildcard src/arch/$(arch)/*.s)
status: EXIT_FAILURE,
}

test! {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't catch this before: The test! macro is kind of deprecated. I don't think it's very clear, and it's often inflexible, so anything custom often requires workarounds, so I try to use the Test struct and constructor for new tests (which the test! macro uses under the hood). So these should be converted to use the Test struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you said for new tests but I took the liberty to convert the other tests that this PR touches as well.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that's great. Maybe one day I'll get around to actually converting all the tests 😅 It strikes me that AI might actually be really good at this kind of tedious but complex refactoring.

@CramBL CramBL force-pushed the improve-start-of-token-err-msg branch from f22aca4 to 0efb648 Compare February 25, 2025 09:54
@casey casey merged commit cb12c60 into casey:master Feb 25, 2025
5 checks passed
@casey
Copy link
Owner

casey commented Feb 25, 2025

Excellent, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve unknown start of token error message for invisible characters
2 participants