~jonsterling/forester#17: 
Support for code highlighting

Status
REPORTED
Submitter
~jonsterling
Assigned to
No-one
Submitted
1 year, 8 months ago
Updated
11 months ago
Labels
feature

~kentookura 1 year, 3 months ago*

I think I can take a stab at this, but I think it will initially be a proof of concept which does not fit perfectly with the current architecture of forester.

The tree-sitter binary can emit highlighted HTML for any language.

One possible implementation for code highlighting in forester would be to call the binary during the build step and to incorporate the emitted html in the final xml, but that would introduce a (forest-)build time dependency, which we want to avoid. Another possibility is to generate the html at view time. That is probably the quickest way for me to hack together a proof of concept for this.

It would require the following changes to forester:

  • parse \code{lang}{prog}
  • solid escaping functionality

I could then use the wasm-bindings for tree-sitter to render the highlighted HTML in the browser. This fits well into what I want to do with forester-html and the first Proof of Concept will probably appear there.

It will be possible to load the wasm blob as an asset and load it with the XSLT sheet, so syntax highlighting can become an "addon" like the live reloading feature.

The change to the parser will also allow me to integrate Penrose, which is very exciting!

Asides:

Here is the implementation of the highlight command: https://github.com/tree-sitter/tree-sitter/blob/5f63074057f90c191868e39a4025725b75eb5917/cli/src/highlight.rs#L374

There does not seem to be an easy way to access the highlight API from the ocaml bindings: https://github.com/search?q=repo%3Asemgrep%2Focaml-tree-sitter-core+highlight&type=code

https://github.com/tree-sitter/tree-sitter/issues/663

~jonsterling 1 year, 3 months ago

This sounds cool. I think the biggest problem to solve first is the escaping issue; I really don't have the experience to know exactly what should be done there, but we've had a few patches so far that don't work... Does anyone know what real languages and systems do?

~kentookura 1 year, 3 months ago*

Perhaps this is a reasonable approach? Inspired by markdown code blocks, adapted to the style of forester.

\\{lang}
\\{{{
foo bar
}}}

~jonsterling 1 year, 3 months ago

Can you explain how this helps with escaping?

~kentookura 1 year, 3 months ago

Perhaps we are talking about different things when we say "escaping". What I think we need for the feature is for the parser to be able to handle arbitrary strings inside the code block. This is not the same thing as escaping arbitrary tokens in other locations.

~jonsterling 1 year, 3 months ago

Well, yes — that is precisely what we need, and this is the thing that we don't have a good solution for at the moment. The closest thing we have is \startverbatim and \stopverbatim, but even this doesn't completely work — for instance, if you use % between those, it will be treated as a comment, but you may actually want to display a percent sign. That's just the tip of the iceberg of things that don't work in the current implementation.

~dannypsnl 1 year, 3 months ago

For lexer thing, maybe identifier in lexer should use a similar but won't escape solution, meanwhile text goes to have escaption.

~jonsterling 1 year, 2 months ago

We can now resume working on this as the verbatim lexer is mostly fixed!

~kentookura 1 year, 2 months ago

On Fri Feb 16, 2024 at 8:31 AM UTC, ~jonsterling wrote:

We can now resume working on this as the verbatim lexer is mostly fixed!

Awesome. Do you have a syntax design in mind for specifying the language for embedded code?

I think that in the end, Sem.node should have a constructor like

Embed_code of {language: string; source: string list } ^ maybe a different type here?

~jonsterling 1 year, 2 months ago

So there's a couple things to consider. Many systems (like various extensions of Markdown, etc.) conflate the following tasks:

  1. Highlighting code and choosing to render it as a block, preformatted (like <pre>, preserving whitespace)
  2. Escaping (e.g. in a code block in Markdown, you don't accidentally make something bold or a heading)

This is a mess, and we can do better. First of all, both inline and "display" (== <pre>) code could in principle be highlighted. Second of all, escaping should be dealt with separately using the lexer commands \startverb and \stopverb. Lastly, we have to decide whether the actual highlighting (== potentially calling out another tool, I don't know what you had in mind) should take place in the renderer or in the evaluator. Putting it in the renderer would cohere with our treatment of math via KaTeX, etc. In this case, the body of the code constructor would not be a list of strings but just Sem.t and we would use Render_verbatim to send this, at the last possible moment, to whatever program is actually doing the highlighting.

Long story short, I am imagining a typical code block to look something like this in the source language:

\code[lang]{haskell}{\startverb
main :: IO ()
main = putStrLn "hello world"
\stopverb}

I understand this is a bit more verbose than in some languages, but I prefer it because it is easier to understand/predict what will happen.

~caimeo 1 year, 2 months ago

A few days ago, I tried to add highlighting to forester without modifying the forester itself. Similar to embedding KaTeX, I added a highlight library prism.js Most works were done by modifying the theme (for details click here):

  • Add prism.js and stylesheet to forest.xsl:
<link rel="stylesheet" href="prism.css" />
<script type="module" src="prism.js"></script>
  • Change the css selector code into code:not(code[class*="language-"]) in style.css to prevent conflict
  • Finally add a macro for creating code block
\def\codeblock[lang][body]{
  \pre{
    \xml{code}[class]{\lang}{\body}
  }
}

It works well and doesn't rely on redundant build steps, which saves time.

~kentookura 1 year, 2 months ago

On Fri Feb 16, 2024 at 2:06 PM UTC, ~caimeo wrote:

A few days ago, I tried to add highlighting to forester without modifying the forester itself. Similar to embedding KaTeX, I added a highlight library prism.js Most works were done by modifying the theme (for details click here):

  • Add prism.js and stylesheet to forest.xsl:
<link rel="stylesheet" href="prism.css" />
<script type="module" src="prism.js"></script>
  • Change the css selector code into code:not(code[class*="language-"]) in style.css to prevent conflict
  • Finally add a macro for creating code block
\def\codeblock[lang][body]{
  \pre{
    \xml{code}[class]{\lang}{\body}
  }
}

It works well and doesn't rely on redundant build steps, which saves time.

Looks great! Much easier than using tree-sitter. I'll still be exploring that avenue though.

~jonsterling 1 year, 2 months ago

That's a cool idea! Nice.

~utensil 11 months ago*

I just want to share that I've also had a js only solution for code highlighting inspired by this issue. I've chosen shiki because it can reuse themes and language syntaxes from VSCode. I want to reuse my favorite theme from VSCode, and highlight a language that most highlighting libraries don't support, Lean 4.

The script is here: https://github.com/utensil/forest/blob/main/assets/shiki.js . It would be much simpler if not for using themes/languages that're not bundled with Shiki.

The real support I need from Forester is a truely robust way to put the code in a verbatim form. For now, anything involves \, unpaired brackets, or things similar to forester syntax could trigger issues.

And I also hope \startverb and \stopverb could be just \verb<<<< and <<<< where <<<< are one or more same characters that is chosen by the user, it could be """, ||| or anything else that the user is easy to ensure there is not such string in the quoted content. This will disable any Forester parsing until the same sequences of characters. And they are easiler to type and visually less distracting.

There is also a need to enable verbatim from a macro, e.g. \verb{\body} in the \codeblock macro above will make whatever in \body verbatim, no further parsing and expansion.

~dannypsnl 11 months ago

  1. Does \startverb changing part related to the shiki solution? Since as I know that will affect existed trees.
  2. What if verbatim from a macro in a verbatim string, what behavior will be proper?

~jonsterling 11 months ago

~utensil: I think it is possible to have those user-chosen tokens, in fact I had considered adding this feature at some point in the recent past. I'm definitely open to it.

~dannypsnl: I'm sorry but I wasn't able to parse these questions... could you elaborate or rephrase?

~dannypsnl 11 months ago

~jonsterling: The first is more like, do we will keep legacy way? Second one seems not make sense to me now, just skip it.

~utensil 11 months ago*

~dannypsnl, for your questions:

  1. No, it's an indepent feature request. Shiki solution works fine for most non-edge cases, no need for a change of the verbatim part. Also I merely suggest a new way to do verb, existing \startverb and \stopverb could still work as a more verbose version and to keep compatibility.

  2. I'll elaborate a little bit. My idea is to expand only one level of macro, i.e. \body in the example, and do no further macro expansion and parsing. This would serve as a syntax sugar, so you don't have to write \verb<<< <<< pair everytime for the code. This is yet another feature request that complements the first, so people can write a codeblock macro to save bolierplates.

~jonsterling 11 months ago

~utensil wrote:

There is also a need to enable verbatim from a macro, e.g. \verb{\body} in the \codeblock macro above will make whatever in \body verbatim, no further parsing and expansion.

Unfortunately, I think there is no reasonable way to make this behaviour consistent.

~jonsterling 11 months ago

~utensil wrote:

The real support I need from Forester is a truely robust way to put the code in a verbatim form. For now, anything involves , unpaired brackets, or things similar to forester syntax could trigger issues.

I do not understand this comment — startverb is already a robust way to put code into verbatim form, and unpaired brackets and backslashes do not cause issues within verbatim mode afaik. (I agree that the syntax is ugly, but I am open to the \verb<<< thing that you mentioned.) Can you clarify what specific aspect of the verbatim mode needs to be improved, ideally with examples that we can use to guide development?

~jonsterling 11 months ago*

Update: I have now added support for custom verbatim heralds, in the sense that the following can be written (note the pipe delimiter):

\verb<<<|
hello world
<<<

Note that in the above, <<< could have been essentially anything. This makes it easier to quote code that might incidentally involve \stopverb, etc.

EDIT: This feature also works "inline" without line breaks. You can write \verb!|hello! to get hello.

~jonsterling 11 months ago*

Finally, I have also made it so that verbatim spans are treated as "groups" by the parser and the evaluator, so you may write things like:

\pre\verb<<<|
this is my code
<<<

instead of the following more annoying code:

\pre{\verb<<<|
this is my code
<<<}

~utensil 11 months ago

I've just updated to use the HEAD of forester and give it a try, this is working, cool! Thanks!

Can you clarify what specific aspect of the verbatim mode needs to be improved, ideally with examples that we can use to guide development?

Sorry, I have recalled why I was under the impression of verb not working, it was a mixture of a few things:

  1. I was under the impression that \tex{preambles}{texcontents} provides a verb environment for texcontents but it's not the case, I need to use \verb explicitly each time I write tex contents, this can't be absorbed into a macro;

  2. Same misunderstanding happens to body in \codeblock{lang}{body} so I kept scores for verb not working incorrectly there again.

  3. \taxon{} can't parse any macros/variables in side its {}, and gives parse error.

So it's totally my memory's mistake.

But I can give another example that I was using verb that could use some improvements:

\p{\code{nlabref[name][citeid][pageid]}: A reference to a nLab article.}

\def\nlabref[name][citeid][pageid]{
  \title{\name - nLab}
  \meta{external}{https://ncatlab.org/nlab/show/\pageid}
  \taxon{reference}
  \meta{bibtex}{\startverb
@misc{nlab-\stopverb\citeid\startverb,
  title  = {\stopverb\name - nLab\startverb},
  author = {nLab},
  year   = {2024},
  url    = {\stopverbhttps://ncatlab.org/nlab/show/\pageid\startverb}
}
  \stopverb}
}

Clearly I was trying to use a template to format some variables but I (mis?)use verb to achieve it. (Sorry if I'm a bit off-topic to the issue)

Register here or Log in to comment, or comment via email.