Oct 31st, 2024 @ justine’s web page
I just learned 42 programming languages this month to build a new syntax
highlighter for
llamafile. I
feel like I’m up to my eyeballs in programming languages right now. Now
that it’s halloween, I thought I’d share some of the spookiest most
surprising syntax I’ve seen.
The languages I decided to support are Ada, Assembly, BASIC, C, C#, C++,
COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML, Java, JavaScript,
Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make, Markdown, MATLAB, Pascal,
Perl, PHP, Python, R, Ruby, Rust, Scala, Shell, SQL, Swift, Tcl, TeX,
TXT, TypeScript, and Zig. That crosses off pretty much everything on
the TIOBE Index
except Scratch,
which can’t be highlighted, since it uses blocks instead of text.
How To Code a Syntax Highlighter
It’s really not difficult to implement a syntax highlighter. You could
probably write one over the course of a job interview. My favorite tools
for doing this have been C++ and
GNU gperf. The hardest
problem here is avoiding the need to do a bunch of string comparisons to
determine if something is a keyword or not. Most developers would just
use a hash table, but gperf lets you create a perfect hash table. For
example:
%{ #include%} %pic %compare-strncmp %language=ANSI-C %readonly-tables %define lookup-function-name is_keyword_java_constant %% true false null
gperf was originally invented for gcc and it’s a great way to squeeze
out every last drop of performance. If you run the gperf
command on the above code above, it’ll
generate this .c file. You’ll
notice its hash function only needs to consider a single character in in
a string. That’s what makes it perfect, and perfect means better
performance. I’m not sure who wants to be able to syntax highlight C at
35 MB per second, but I am now able to do so, even though I’ve defined
about
4,000 keywords for the language. Thanks to gperf, those keywords
don’t slow things down.
The rest just boils down to finite state machines. You don’t really need
flex, bison, or ragel to build a basic syntax highlighter. You simply
need a for
loop and a switch
statement. At
least for my use case, where I’ve really only been focusing on strings,
comments, and keywords. If I wanted to highlight things like C function
names, well, then I’d probably need to do actual parsing. But focusing
on the essentials, we’re only really doing lexing at most. See
highlight_ada.cpp
as an example.
Demo
All the research you’re about to read about on this page, went into
making one thing, which is llamafile’s new syntax highlighter. This is
probably the strongest advantage that llamafile has over ollama these
days, since ollama doesn’t do syntax highlighting at all. Here’s a demo
of it running on Windows 10, using the
Meta
LLaMA 3.2 3B Instruct model. Please note, these llamafiles will run
on MacOS, Linux, FreeBSD, and NetBSD too.
The new highlighter and chatbot interface has made llamafile so pleasant
for me to use, combined with the fact that open weights models
like gemma
27b it have gotten so good, that it’s become increasingly rare that
I’ll feel tempted to use Claude these days.
Examples of Surprising Lexical Syntax
So while writing this highlighter, let’s talk about the kinds of lexical
syntax that surprised me.
C
The C programming language, despite claiming to be simple, actually has
some of the weirdest lexical elements of any language. For starters, we
have trigraphs, which were probably invented to help Europeans use C
when using keyboards that didn’t
include #
, [
, \
, ^
, {
, |
, }
,
and ~
. You can replace those characters
with ??=
, ??(
, ??/
, ??)
, ??'
, ??<
, ??!
, ??>
,
and ??-
. Intuitive, right? That means, for example, the
following is perfectly valid C code.
int main(int argc, char* argv??(??)) ??< printf("hello world\n"); ??>
That is, at least until trigraphs were removed in the C23 standard.
However compilers will be supporting this syntax forever for legacy
software, so a good syntax highlighter ought to too. But just because
trigraphs are officially dead, doesn’t mean the standards committees
haven’t thought up other weird syntax to replace it. Consider universal
characters:
int \uFEB2 = 1;
This feature is useful for anyone who wants, for example, variable names
with arabic characters while still keeping the source code pure ASCII.
I’m not sure why anyone would use it. I was hoping I could abuse this to
say:
int main(int argc, char* argv\u005b\u005d) \u007b printf("hello world\n"); \u007d
But alas, GCC raises an error if universal characters aren’t used on the
specific UNICODE planes that’ve been blessed by the standards committee.
This next one is one of my favorites. Did you know that a single line
comment in C can span multiple lines if you use backslash at the end of
the line?
//hi\
there
Most other languages don’t support this. Even languages that allow
backslash escapes in their source code (e.g. Perl, Ruby, and Shell)
don’t have this particular feature from C. The ones that do support this
too, as far as I can tell, are Tcl and GNU Make. Tools for syntax
highlighting oftentimes get this wrong, like Emacs and Pygments.
Although Vim seems to always be right about backslash.
Haskell
Every C programmers knows you can’t embed a multi-line comment in a
multi-line comment. For example:
/*
hello
/* again */
nope nope nope
*/
However with Haskell, you can. They finally fixed the bug. Although they
did adopt a different syntax.
-- Test nested comments within code blocks let result3 = {- This comment contains {- a nested comment -} -} 10 - 5
Tcl
The thing that surprised me most about Tcl, is that identifiers can have
quotes in them. For example, this program will print a"b
:
puts a"b
You can even have quote in your variable names, however you’ll only be
able to reference it if you use the ${a"b}
notation, rather
than $a"b
.
set a"b doge puts ${a"b}
JavaScript
JavaScript has a builtin lexical syntax for regular expressions. However
it’s easy to lex it wrong if you aren’t paying attention. Consider the
following:
var foo = /[/]/g;
When I first wrote my lexer, I would simply scan for the closing slash,
and assume that any slashes inside the regex would be escaped. That
turned out to be wrong when I highlighted some minified code. If a slash
is inside the square quotes for a character set, then that slash doesn’t
need to be escaped!
Now onto the even weirder.
There’s some invisible UNICODE characters called the LINE SEPARATOR
(u2028) and PARAGRAPH SEPARATOR (u2029). I don’t know what the use case
is for these codepoints, but the ECMAScript
standard defines
them as line terminators, which effectively makes them the same
thing as \n
. Since these
are Trojan Source characters,
I configure my Emacs to render them as ↵ and ¶. However most software
hasn’t been written to be aware of these characters, and will oftentimes
render them as question marks. Also as far as I know, no other language
does this. I was able to use that to my advantage for SectorLISP, since
it let me create C + JavaScript polyglots.
javascript syntax highlighting//¶` ... C only code goes here ... //`
That’s how I’d insert C code into JavaScript files.
c syntax highlighting//¶` #if 0 //` ... JavaScript only code goes here ... //¶` #endif //`
And that’s how I’d insert JavaScript into my C source code. An example
of a piece of production code where I did this
is lisp.js which is what powers
my SectorLISP blog post. It both runs in
the browser, and you can compile it with GCC and run it locally too.
llamafile is able to correctly syntax highlight this stuff, but I’ve yet
to find another syntax highlighter that does too. Not that it matters,
since I doubt an LLM would ever print this. But it sure is fun to think
about these corner cases.
Shell
We’re all familiar with the heredoc syntax of shell scripts, e.g.
cat <this is kind of a multi-line string EOF
The above syntax allows you to put $foo
in your heredoc
string, although there’s a quoted syntax which disables variable
substitution.
cat <<'END' this won't print the contents of $var END
If you ever want to confuse your coworkers, then one great way to abuse
this syntax is by replacing the heredoc marker with an empty string, in
which case the heredoc will end on the next empty line. For example,
this program will print “hello” and “world” on two lines:
cat <<'' hello echo world
It’s also possible in languages that support heredocs (Shell, Ruby, and
Perl) to have multiple heredocs on the same line.
cat /dev/fd/3 3<< E1 /dev/fd/4 4<< E2 foo E1 bar E2
Another thing to look out for with shell, is it’s like Tcl in the sense
that special characters like #, which you might think would always begin
a comment, can actually be valid code depending on the context. For
example, inside a variable reference, # can be used to strip a prefix.
The following program will print “there”.
x=hi-there echo ${x#hi-}
String Interpolation
Did you know that, from a syntax highlighting standpoint, a Kotlin
string can begin with ” but end with the { character? That’s the way
it’s string interpolation syntax works. Many languages let you embed
variable name references in strings, but TypeScript, Swift, Kotlin, and
Scala take string interpolation to the furthest extreme of encouraging
actual code being embedded inside strings.
val s2 = "${s1.replace("is", "was")}, but now is $a"
So to highlight a string with Kotlin, Scala, and TypeScript, one must
count curly brackets and maintain a stack of parser states. With
TypeScript, this is relatively trivial, and only requires a couple
states to be added to your finite state machine. However with Kotlin and
Scala it gets real hairy, since they support both double quote and
triple quote syntax, and either one of them can have interpolated
values. So that ended up being about 13 independent states the FSM needs
for string lexing alone. Swift also supports triple quotes for its
"\(var)"
interpolated syntax, however that only needed 10
states to support.
Swift
Swift has its own unique approach to the problem of embedding strings
inside a string. It allows “double quote”, “””triple quote”””, and
/regex/ strings to all be surrounded with an arbitrary number of #hash#
marks, which must be mirrored on each side. This makes it possible to
write code like the following:
let threeMoreDoubleQuotationMarks = #""" Here are three more double quotes: """ """# let threeMoreDoubleQuotationMarks = ##""" Here are three more double quotes: #"""# """##
C#
C# supports Python’s triple quote multi-line string syntax, but with an
interesting twist that’s unique to this language. The way C# solves the
“embed a string inside a string” problem, is they let you do quadruple
quoted strings, or even quintuple quoted strings if you want. However
many quotes you put on the lefthand side, that’s what’ll be used to
terminate the string at the other end.
Console.WriteLine(""); Console.WriteLine("\""); Console.WriteLine(""""""); Console.WriteLine(""""""); Console.WriteLine(""" yo "" hi """); Console.WriteLine("""" yo """ hi """"); Console.WriteLine(""""First """100 Prime""" Numbers: """");
This is the way if you ask me, because it’s actually simpler for a
finite state machine to decode. With classic Python triple quoted
strings, you need extra rules, to ensure it’s either one double-quote
character, or exactly three. By letting it be an arbitrary number,
there’s fewer rules to validate. So you end up with a more powerful
expressive language that’s simpler to implement. This is the kind of
genius we’ve come to expect from Microsoft.
What will they think of next?
FORTH
Normally when code is simpler for a computer to decode, it’s more
difficult for a human to understand, and FORTH is proof of that. FORTH
is probably the simplest language there is, because it tokenizes
everything on whitespace boundaries. Even the syntax for starting a
string is a token. For example:
c" hello world"
Would mean the same thing as saying "hello world"
in every
other language.
FORTRAN and COBOL
One of the use cases I envision for llamafile is that it can help the
banking system not collapse once all the FORTRAN and COBOL programmers
retire. Let’s say you’ve just been hired to maintain a secretive
mainframe full of confidential information written in the COmmon
Business-Oriented Language. Thanks to llamafile, you can ask an
air-gapped AI you control,
like Gemma
27b, to write your COBOL and FORTRAN code for you. It can’t print
punch cards, but it can highlight punch card syntax. Here’s what FORTRAN
code looks like, properly syntax highlighted:
* * Quick return if possible. * IF ((M.EQ.0) .OR. (N.EQ.0) .OR. + (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN * * And if alpha.eq.zero. * IF (ALPHA.EQ.ZERO) THEN IF (BETA.EQ.ZERO) THEN DO 20 J = 1,N DO 10 I = 1,M C(I,J) = ZERO 10 CONTINUE 20 CONTINUE ELSE DO 40 J = 1,N DO 30 I = 1,M C(I,J) = BETA*C(I,J) 30 CONTINUE 40 CONTINUE END IF RETURN END IF
FORTRAN has the following fixed column rules.
- Putting *, c, or C in column 1 will make the line a comment
- Putting non-space in column 6 lets you extend a line past 80 characters
- Labels are created by putting digits in columns 1-5
Now here’s some properly syntax highlighted COBOL code.
000100*Hello World in COBOL 000200 IDENTIFICATION DIVISION. 000300 PROGRAM-ID. HELLO-WORLD. 000400 000500 PROCEDURE DIVISION. 000600 DISPLAY 'Hello, world!'. 000700 STOP RUN.
With COBOL, the rules are:
- Putting * in column 7 makes the line a comment
- Putting – in column 7 lets you extend a line past 80 characters
- Line numbers go in columns 1-6.
Zig
Zig has its own unique solution for multi-line strings, which are
prefixed with two backslashes.
const copyright = \\ Copyright (c) 2024, Zig Incorporated \\ All rights reserved. ;
What I like about this syntax, is it eliminates that need we’ve always
had for calling textwrap.dedent()
with Python’s triple
quoted strings. The tradeoff is that the semicolon is ugly. This is a
string syntax that really ought to be considered by one of the languages
that don’t need semicolons, e.g. Go, Scala, Python, etc.
Lua
Lua has a very unique multi-line string syntax, and it uses an approach
similar to C# and Swift when it comes to solving the “embed a string
inside a string” problem. It works by using double square brackets, and
it lets you put an arbitrary number of equal signs inbetween them.
-- this is a comment [[hi [=[]=] ]] there [[hi [=[]=] ]] there [==[hi [=[]=] ]==] hello [==[hi ]=]==] hello [==[hi ]===]==] hello [====[hi ]===]====] hello
What’s really interesting is that it lets you do this with comments too.
--[[ comment #1 ]] print("hello") --[==[ comment [[#2]] ]==] print("world")
Assembly
One of the most challenging languages to syntax highlight is assembly,
due to the fragmentation of all its various dialects. I’ve sought to
build something with llamafile that does a reasonably good job with
AT&T, nasm, etc. syntax. Here’s nasm syntax:
section .data message db 'Hello, world!', 0xa ; The message string, ending with a newline section .text global _start _start: ; Write the message to stdout mov rax, 1 ; System call number for write mov rdi, 1 ; File descriptor for stdout mov rsi, message ; Address of the message string mov rdx, 13 ; Length of the message syscall ; Exit the program mov rax, 60 ; System call number for exit xor rdi, rdi ; Exit code 0 syscall
And here’s AT&T syntax:
/ syscall .globl _syscall,csv,cret,cerror _syscall: jsr r5,csv mov r5,r2 add $04,r2 mov $9f,r3 mov (r2)+,r0 bic $!0377,r0 bis $sys,r0 mov r0,(r3)+ mov (r2)+,r0 mov (r2)+,r1 mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ mov (r2)+,(r3)+ sys 0; 9f bec 1f jmp cerror 1: jmp cret .data 9: .=.+12.
And here’s GNU syntax:
/ setjmp() for x86-64 // this is a comment too ; so is this # this too! ! hello sparc setjmp: lea 8(%rsp),%rax mov %rax,(%rdi) mov %rbx,8(%rdi) mov %rbp,16(%rdi) mov %r12,24(%rdi) mov %r13,32(%rdi) mov %r14,40(%rdi) mov %r15,48(%rdi) mov (%rsp),%rax mov %rax,56(%rdi) xor %eax,%eax ret
With keywords I’ve found the simplest thing is to just treat the
first identifier on the line (that isn’t followed by a colon) as a
keyword. That tends to make most of the assembly I’ve tried look pretty
reasonable.
The comment syntax is real hairy. I really like the original UNIX
comments which only needed a single slash. GNU as still supports those
to this date, but only if they’re at the beginning of the line (UNIX
could originally put them anywhere, since as
didn’t have
the ability to do arithmetic back then). Clang doesn’t support fixed
comments at all, so they’re sadly not practical anymore to use in open
source code.
But this story gets even better. Another weird thing about the original
UNIX assembler is that it didn’t use a closing quote on character
literals. So where we’d say 'x'
to get 0x78 for x, in the
original UNIX source code, you’d say 'x
. This is another
thing GNU as continues to support, but sadly not LLVM. In any case,
since a lot of code exists that uses this syntax, any good syntax
highlighter needs to support it.
The GNU assembler allows identifiers to be quoted, so you can put pretty
much any character in a symbol.
Finally, it’s not enough to just highlight assembly when highlighting
assembly. The assembler is usually used in conjunction with either the C
preprocessor, or m4. Trust me, lots of open source code does this.
Therefore lines starting with dnl
, m4_dnl
,
or C
should be taken as comments too.
Ada
Ada is a remarkably simple language to lex, but there’s one thing I
haven’t quite wrapped my head around yet, which is its use of the single
quotation mark. Ada can have character literals like C,
e.g. 'x'
. But single quote can also be used to reference
attributes, e.g. Foo'Size
. Single quote even lets you embed
expressions and call functions. For example, the program:
with Ada.Text_IO; procedure main is S : String := Character'(')')'Image; begin Ada.Text_IO.Put_Line("The value of S is: " & S); end main;
Will print out:
The value of S is: ')'
Because we’re declaring a character, giving it a value, and then sending
it through the Image
function, which converts it to
a String
representation.
BASIC
Let’s talk about the Beginner’s All-purpose Symbolic Instruction Code.
While digging through the repos I’ve git cloned, I came across this old
Commodore BASIC program that broke many of my assumptions about syntax
highlighting.
10 rem cbm basic v2 example 20 rem comment with keywords: for, data 30 dim a$(20) 35 rem the typical space efficient form of leaving spaces out: 40 fort=0to15:poke646,t:print"{revers on} ";:next 50 geta$:ifa$=chr$(0):goto40 55 rem it is legal to omit the closing " on line end 60 print"{white}":print"bye... 70 end
We’ll notice that this particular BASIC implementation didn’t require a
closing quote on strings, variable names have these weird sigils, and
keywords like goto
are lexed eagerly out of identifiers.
Visual BASIC also has this weird date literal syntax:
Dim v As Variant ' Declare a Variant v = #1/1/2024# ' Hold a date
That’s tricky to lex, because VB even has preprocessor directives.
#If DEBUG ThenPublic Function SomeFunction() As String #Else Public Function SomeFunction() As String #End If
Perl
One of the trickier languages to highlight is Perl. It’s exists in the
spiritual gulf between shells and programming languages, and inherits
the complexity of both. Perl isn’t as popular today as it once was, but
its influence continues to be prolific. Perl made regular expressions a
first class citizen of the language, and the way regex works in Perl has
since been adopted by many other programming languages, such as Python.
However the regex lexical syntax itself continues to be somewhat unique.
For example, in Perl, you can replace text similar to sed as follows:
my $string = "HELLO, World!"; $string =~ s/hello/Perl/i; print $string; # Output: Perl, World!
Like sed, Perl also allows you to replace the slashes with an arbitrary
punctuation character, since that makes it easier for you to put slashes
inside your regex.
$string =~ s!hello!Perl!i;
What you might not have known, is that it’s possible to do this with
mirrored characters as well, in which case you need to insert an
additional character:
$string =~ s{hello}{Perl}i;
However s///
isn’t the only weird thing that needs to be
highlighted like a string. Perl has a wide variety of other magic
prefixes.
/case sensitive match/ /case insensitive match/i y/abc/xyz/e s!hi!there! m!hi!i m;hi;i qr!hi!u qw!hi!h qq!hi!h qx!hi!h m-hi- s-hi-there-g s"hi"there"g s@hi@there@ yo s{hi}{there}g
One thing that makes this tricky to highlight, is you need to take
context into consideration, so you don’t accidentally think
that y/x/y/
is a division formula. Thankfully, Perl makes
this relatively easy, because variables can always be counted upon to
have sigils, which are usually $
for
scalars, @
for arrays, and %
for hashes.
my $greeting = "Hello, world!"; # Array: A list of names my @names = ("Alice", "Bob", "Charlie"); # Hash: A dictionary of ages my %ages = ("Alice" => 30, "Bob" => 25, "Charlie" => 35); # Print the greeting print "$greeting\n"; # Print each name from the array foreach my $name (@names) { print "$name\n"; }
This helps us avoid the need for parsing the language grammar.
Perl also has this goofy convention for writing man pages in your source
code. Basically, any =word at the start of the line will get it going,
and =cut
will finish it.
#!/usr/bin/perl =pod =head1 NAME my_silly_script - A Perl script demonstrating =cut syntax =head1 SYNOPSIS my_silly_script [OPTIONS] =head1 DESCRIPTION This script does absolutely nothing useful, but it showcases the quirky =cut syntax for POD documentation in Perl. =head1 OPTIONS There are no options. =head1 AUTHOR Your Name=head1 COPYRIGHT Copyright (c) 2023 Your Name. All rights reserved. =cut print "Hello, world!\n";
Ruby
Of all the languages, I’ve saved the best for last, which is Ruby. Now
here’s a language whose syntax evades all attempts at understanding.
Ruby is the union of all earlier languages, and it’s not even formally
documented. Their manual has a section
on
Ruby syntax, but it’s very light on details. Whenever I try to test
my syntax highlighting, by concatenating all the .rb files on my hard
drive, there’s always another file that finds some way to break it.
def `(command) return "just testing a backquote override" end
Since ruby supports backquote syntax like var
, I’m not exactly sure how
= `echo hello`
to tell that the backquote above isn’t meant to be highlighted as a
string. Another example is this:
when /\.*\.h/ options[:includes] <true when /--(\w+)=\"?(.*)\"?/ options[$1.to_sym] = $2; true
Ruby has a <<
operator, and it also supports heredocs
(just like Perl and Shell). So I’m not exactly sure how to tell that the
code above isn’t a heredoc. Yes that code actually exists in the wild.
Even Emacs gets this wrong. Out of all 42 languages I’ve evaluated,
that’s probably the biggest shocker so far. It might be the case that
Ruby isn’t possible to lex without parsing. Even with parsing, I’m still
not sure how it’s possible to make sense of that.
Complexity of Supported Languages
If I were to rank the complexity of programming languages by how many
lines of code each one takes to syntax highlight, then FORTH would be
the simplest language, and Ruby would be the most complicated.
125 highlight_forth.cpp 266 highlight_lua.cpp 132 highlight_m4.cpp 282 highlight_csharp.cpp 149 highlight_ada.cpp 282 highlight_rust.cpp 160 highlight_lisp.cpp 297 highlight_python.cpp 163 highlight_test.cpp 300 highlight_java.cpp 166 highlight_matlab.cpp 321 highlight_haskell.cpp 186 highlight_cobol.cpp 335 highlight_markdown.cpp 199 highlight_basic.cpp 337 highlight_js.cpp 200 highlight_fortran.cpp 340 highlight_html.cpp 211 highlight_sql.cpp 371 highlight_typescript.cpp 216 highlight_tcl.cpp 387 highlight_kotlin.cpp 218 highlight_tex.cpp 387 highlight_scala.cpp 219 highlight.cpp 447 highlight_asm.cpp 220 highlight_go.cpp 449 highlight_c.cpp 225 highlight_css.cpp 455 highlight_swift.cpp 225 highlight_pascal.cpp 560 highlight_shell.cpp 230 highlight_zig.cpp 563 highlight_perl.cpp 235 highlight_make.cpp 624 highlight_ruby.cpp 239 highlight_ld.cpp 263 highlight_r.cpp
llamafile is a
Mozilla project who
sponsors me to work on it. My work on open source is also made possible
by my GitHub sponsors
and Patreon subscribers.
Thank you for giving me the opportunity to serve you all these last four
years. Since you’ve read this far, I’d like to invite you to join both
the Mozilla AI Discord and
the Redbean Discord servers
where you can chat with me and other people who love these projects.