AI prompts
base on A programming language benchmark **TL;DR**: see the figure below. Note that nqueen and matmul are implemented in
all languages but sudoku and bedcov are only implemented in some.
<img align="left" width="100%" src="https://i.ibb.co/zGgbmVG/template.png?v32">
## Table of Content
- [Introduction](#intro)
- [Results](#result)
- [Overall impression](#overall)
- [Caveats](#caveat)
- [Startup time](#startup)
- [Elapsed time vs CPU time](#cputime)
- [Subtle optimizations](#opt)
- [Optimizing inner loops](#matmul)
- [Controlling memory layout](#memlayout)
- [Discussions](#conclusion)
- [Appendix: Timing on Apple M1 Macbook Pro](#table)
## <a name="intro"></a>Introduction
Programming Language Benchmark v2 (plb2) evaluates the performance of 25
programming languages on four CPU-intensive tasks. It is a follow-up to
[plb][plb] conducted in 2011. In plb2, all implementations use the same
algorithm for each task and their performance bottlenecks do not fall in
library functions. We do not intend to compare different algorithms or the
quality of the standard libraries in these languages. Plb2 aims to
evaluate the performance of a language when you have to implement a new
algorithm in the language - this may happen if you can't find the algorithm in
existing libraries.
The four tasks in plb2 all take a few seconds for a fast implementation to
complete. The tasks are:
* **nqueen**: solving a [15-queens problem][8queen]. The algorithm was inspired
by the second C implementation [from Rosetta Code][8qrc]. It involves nested
loops and integer bit operations.
* **matmul**: multiplying two square matrices of 1500x1500 in size.
* **sudoku**: solving 4000 hard [Sudokus][sudoku] (20 puzzles repeated for 200
times) using the [kudoku algorithm][kudoku]. This algorithm heavily uses
small fixed-sized arrays with a bit complex logic.
* **bedcov**: finding the overlaps between two arrays of 1,000,000 intervals
with [implicit interval trees][iitree]. The algorithm involves frequent
array access in a pattern similar to binary searches.
Every language has nqueen and matmul implementations. Some languages do not
have sudoku or bedcov implementations. Most programs were initially implemented
by me and a few were contributed by others. As I am mostly a C programmer,
implementations in other languages may be suboptimal. **Pull requests are welcomed!**
## <a name="result"></a>Results
The figure at the top of the page summarizes the elapsed time of each implementation
measured on an Apple M1 MacBook Pro. [Hyperfine][hyperfine] was used for timing
except for a few slow implementations which were timed with the "time" bash
command without repetition. A plus sign "+" indicates [ahead-of-time
compilation][aot] (AOT). Exact timing can be found in the [table below](#table). The
figure was [programmatically generated](analysis) from the table.
### <a name="overall"></a>Overall impression
Programming language implementations in plb2 can be classified into three groups
depending on how and when compilation is done:
1. Purely interpreted (QuickJS, Perl and [CPython][cpy], the official Python
implementation). Not surprisingly, these are among the slowest language
implementations in this benchmark.
2. JIT compiled (Dart, Bun/Node, Java, Julia, LuaJIT, PHP, PyPy and Ruby3 with
[YJIT][yjit]). They are generally faster than pure interpretation.
Nonetheless, there is a large variance in this group. While PHP and Ruby3
are faster than Perl and CPython, they are still an order of magnitude
slower than PyPy. The two JavaScript engines (Bun and Node) and Julia
perform well. They are about twice as fast as PyPy.
3. AOT compiled (the rest). Optimizing binaries for specific hardware, these
compilers tend to generate the fastest executables.
### <a name="caveat"></a>Caveats
#### <a name="startup"></a>Startup time
Some JIT-based language runtimes take up to ~0.3 second to compile and warm-up.
We are not separating out this startup time. Nonetheless, because most
benchmarks run for several seconds, including the startup time does not greatly
affect the results.
#### <a name="cputime"></a>Elapsed time vs CPU time
Although no implementations use multithreading, language runtimes may be doing
extra work, such as garbage collection, in a separate thread. In this case, the
CPU time (user plus system) may be longer than elapsed wall-clock time. Julia,
in particular, takes noticeably more CPU time than wall-clock time even for the
simplest nqueen benchmark. In plb2, we are measuring the elapsed wall-clock
time because that is the number users often see. The ranking of CPU time may be
slightly different.
### <a name="opt"></a>Subtle optimizations
#### <a name="memlayout"></a>Controlling memory layout
When implementing bedcov in Julia, C and many compiled languages, it is
preferred to have an array of objects in a contiguous memory block such that
adjacent objects are close in memory. This helps cache efficiency. In most
scripting languages, unfortunately, we have to put references to objects in an
array at the cost of cache locality. The issue can be alleviated by cloning
objects to a new array. This doubles the speed of PyPy and Bun.
#### <a name="matmul"></a>Optimizing inner loops
The bottleneck of matrix multiplication falls in the following nested loop:
```cpp
for (int i = 0; i < n; ++i)
for (int k = 0; k < n; ++k)
for (int j = 0; j < n; ++j)
c[i][j] += a[i][k] * b[k][j];
```
It is obvious that `c[i]`, `b[k]` and `a[i][k]` can be moved out of the inner
loop to reduce the frequency of matrix access. The Clang compiler can apply
this optimization. Manual optimization may actually hurt performance.
However, **many other languages cannot optimize this nested loop.** If we
manually move `a[i][k]` to the loop above it, we can often improve their
performance. Some C/C++ programmers say compilers often optimize better than
human, but this might not be the case in other languages.
## <a name="conclusion"></a>Discussions
The most well-known and the longest running language benchmark is the [Computer
Language Benchmark Games][clbg]. Plb2 differs in that it includes different
languages (e.g. Nim and Crystal), different language runtimes (e.g. PyPy and
LuaJIT) and new tasks, and it comes with more uniform
implementations and focuses more on the performance of the language itself
without library functions. **Plb2 complements the Computer Language Benchmark
Games.**
One important area that plb2 does not evaluate is the performance of memory
allocation and/or garbage collection. This may contribute more to practical
performance than generating machine code. Nonetheless, it is challenging to
design a realistic micro-benchmark to evaluate memory allocation. If the
built-in allocator in a language implementation does not work well, we can
implement customized memory allocator just for the specific task but this, in
my view, would not represent typical use cases.
When plb was conducted in 2011, half of the languages in the figure above were
not mature or even did not exist. It is exciting to see many of them have
reached the 1.0 milestone and are gaining popularity among modern programmers.
On the other hand, Python remains one of the two most used scripting languages
despite its poor performance. In my view, this is because PyPy would not be
officially endorsed while other JIT-based languages are not general or good
enough. Will there be a language to displace Python in the next decade? I am
not optimistic.
## <a name="table"></a>Appendix: Timing on Apple M1 Macbook Pro
In the following table, star "\*" indicates AOT compilation and plus "+"
indicates JIT compilation.
|Label |Language |Runtime|Version| Plot | nqueen | matmul | sudoku | bedcov |
|:--------|:---------|:------|:------|:----:|-------:|-------:|-------:|-------:|
|c:clang* |C |Clang |15.0.0 | Y | 2.57 | 0.54 | 1.56 | 0.84 |
|cl:sbcl* |Lisp |SBCL |2.4.0 | Y | 3.19 | 3.84 | 3.40 | |
|codon\* |Codon | |0.16.3 | N | 2.91 | 2.48 | 3.10 | |
|crystal* |Crystal | |1.10.0 | Y | 3.28 | 2.45 | 3.14 | 0.87 |
|c#:.net* |C# |.NET |8.0.100| Y | 2.82 | 1.38 | 1.62 | 0.99 |
|d:ldc2* |D |LDC2 |1.35.0 | Y | 2.68 | 0.57 | 1.60 | 0.98 |
|dart:jit+|Dart |(JIT) |3.2.4 | Y | 3.62 | 2.74 | 3.24 | 2.85 |
|elixir+ |Elixir | |1.15.7 | Y | 26.17 | 67.39 | | |
|f90:gcc* |Fortran |GCC |13.2.0 | Y | 2.67 | 0.51 | 1.84 | |
|go* |Go | |1.21.5 | Y | 2.94 | 1.14 | 2.04 | 0.94 |
|java+ |Java |OpenJDK|20.0.1 | Y | 3.92 | 1.14 | 3.20 | 3.04 |
|js:bun+ |JavaScript|Bun |1.0.20 | Y | 3.11 | 1.75 | 3.07 | 2.32 |
|js:deno+ |JavaScript|Deno |1.39.1 | N | 4.00 | 3.06 | 4.04 | 2.50 |
|js:k8+ |JavaScript|k8 |1.0 | N | 3.79 | 2.99 | 3.76 | 2.60 |
|js:node+ |JavaScript|Node |21.5.0 | Y | 3.73 | 2.88 | 3.77 | 2.45 |
|js:node |JavaScript|Node-nojit|21.5.0|N | 55.48 | 162.84 | 63.91 | 20.81 |
|js:qjs |JavaScript|QuickJS|23-12-09|Y | 59.04 | 135.66 | 67.55 | 37.56 |
|julia+ |Julia | |1.10.0 | Y | 3.02 | 0.76 | 2.18 | 1.96 |
|luajit+ |Lua |LuaJIT |2.1 | Y | 5.31 | 2.66 | 4.48 | 10.52 |
|mojo* |Mojo | |0.6.1 | Y | 3.24 | 1.12 | | |
|nim* |Nim | |2.0.2 | Y | 2.57 | 0.56 | 1.64 | 1.07 |
|ocaml* |OCaml | |4.14.1 | Y | 3.56 | 2.14 | | |
|perl |Perl | |5.34.1 | Y | 158.34 | 158.01 | 90.78 | |
|php+ |PHP | |8.3 | Y | 48.15 | 71.20 | | |
|py:cpy |Python |CPython|3.11.7 | Y | 159.97 | 117.81 | 52.88 | 42.84 |
|py:graal+|Python |Graal EE|23.1.1| N | 4.38 | 16.22 | 59.52 | 12.32 |
|py:pypy+ |Python |PyPy |7.3.14 | Y | 6.91 | 4.89 | 8.82 | 6.27 |
|rb:crb+ |Ruby |CRuby+yjit|3.3.0| Y | 87.53 | 64.95 | 17.47 | 37.07 |
|rb:graal+|Ruby |Graal EE|23.1.1| Y | 6.54 | 4.10 | 4.11 | 5.23 |
|rust* |Rust | |1.75.0 | Y | 2.49 | 0.56 | 1.65 | 0.94 |
|scm:ch+ |Scheme |Chez |9.5.8 | Y | 3.54 | 18.98 | | |
|swift* |Swift | |5.9.0 | Y | 2.92 | 0.56 | 1.78 | 1.21 |
|v* |V | |0.4.3 | Y | 2.55 | 0.57 | 1.59 | 1.23 |
|zig* |Zig | |0.11.0 | Y | 2.72 | 0.56 | | |
[plb]: https://github.com/attractivechaos/plb
[8queen]: https://en.wikipedia.org/wiki/Eight_queens_puzzle
[8qrc]: https://rosettacode.org/wiki/N-queens_problem#C
[sudoku]: https://en.wikipedia.org/wiki/Sudoku
[kudoku]: https://attractivechaos.github.io/plb/kudoku.html
[iitree]: https://academic.oup.com/bioinformatics/article/37/9/1315/5910546
[hyperfine]: https://github.com/sharkdp/hyperfine
[cpy]: https://en.wikipedia.org/wiki/CPython
[pypy]: https://www.pypy.org
[bun]: https://bun.sh
[luablog]: https://attractivechaos.wordpress.com/2011/01/23/amazed-by-luajit/
[yjit]: https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md
[aot]: https://en.wikipedia.org/wiki/Ahead-of-time_compilation
[clbg]: https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html
[axpy]: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1
", Assign "at most 3 tags" to the expected json: {"id":"6663","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"