-
-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making escaped characters first class citizens... #366
Comments
Con: We're creating an abstract syntax tree, not a concrete syntax tree. In commonmark Pro: Considerations of practicality might override the conceptual argument above. It would be fairly easy to do this. For a related issue (regarding entities), see commonmark/commonmark-spec#442. I like the idea of having special AST nodes for entities, and one might make a similar case for escapes. There are some complications noted there, and some of them apply to escapes as well. |
@kivikakk @nwellnhof I'd be interested in your thoughts on this issue. |
Thanks for the quick response @jgm!
I think because escaping, to me, really is behavior, like italics - I interpret it to mean "I don't care what context I'm in or what you think I should be doing, I am this character". It adds a behavior to that character. And you typically wouldn't use it unless you wanted that specific behavior, to override some other behavior.
In terms of the fact that we consider that it special and what to do about it, I agree. But I still think it would be helpful to let the users of the AST know that the user specifically escaped this character (they wanted that specific behavior), same way we let them know the user marked something as italics. |
I broadly agree with @jgm's assessment: the escape character feels like it doesn't belong in an abstract syntax tree, but, practically speaking I think it'd enable library consumers to do something they want to do. I don't imagine GitHub or GitLab will be trying to move their HTML pipeline into cmark itself any time soon, so exposing |
You could also argue that But we should keep in mind that adding a new node type requires a new parser option to avoid breaking the API. |
Indeed. Creating new nodes and generating a new parser seems a lot of work and overhead. Nonetheless access to the markup flavour can be very useful and allow context-specific customisations:
The parser only collects and transmits the information. Knowledge of the flavour is irrelevant for parsing and is only intended for consumers. In the best of worlds I would image this as an extension of the |
Maybe there could be another issue with a request to make more information about the concrete syntax available in the AST. As for this particular topic, I think the OP's original issue has to do with GitLab and GitHub's treatment of |
You're right, we (GitLab) do post processing on the HTML using a set of pipelines and filters. For example, the The problem is, of course, there is no way in the html to know if an But using So if we have the information in the AST (through a new node or a flavor on a node or whatever), then we could determine whether a user escaped a character. And I think it's important to be able to preserve that CommonMark feature through to the final rendering, no matter what post processing is done on html. The easiest solution would be to surround an escaped character with
|
I'm no longer at GitHub, but I think I'd be correct in saying they still use the same process they did when I was there working on it: they convert the GFM into HTML using Commonmarker, using cmark's builtin HTML renderer. They then parse that HTML into a DOM and transform it in successive stages using an internal library mostly based on Gumbo, which calls into Ruby when fragments of the DOM match filters. Without other changes, only modifying how the escaped character was represented in the AST wouldn't do much, but they could possibly modify |
I think I can bring this issue to a close. We solved this by pre-processing the markdown, and then post-processing it. Here's the comment from the code explaining this:
|
When using certain characters, for example
@
or#
in GitHub or GitLab, they trigger special features, such as linking to a user or an issue.But if I want to specifically use the syntax without the expansion, it's not possible, even when backslashing the character:
Referring to point #2
--> Referring to point Fixed implicit casts fromsize_t
toint
. #2Referring to point \#2
--> Referring to point Fixed implicit casts fromsize_t
toint
. #2Referring to point #2
--> Referring to point Fixed implicit casts fromsize_t
toint
. #2Referring to point \#2
--> Referring to point #2handle_backslash
adds it as a normal text node to the AST. There is no way for us to tell in the output that this used to be an escaped character and should be ignored in any additional processing.commonmark.js
leaves the text node in the AST so you can guess that it might have been escaped, butcmark
collapses text nodes, so even that hint is gone.I outline a few ideas in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/45922
But I think the best would be to make it a first class citizen, meaning adding a new node type for it, like
CMARK_NODE_ESCAPED
orCMARK_NODE_LITERAL
. Default rendering would just output the character, but it would allow a different renderer to make a better decision how to render that character. Maybe by wrapping it in a span tag, for example, which would bypass a special character scanner.I just read about Roundtripping issues with escaped entities so I'm sure yet how that fits in.
A couple currently open issues:
wdyt?
The text was updated successfully, but these errors were encountered: