Trendshift - Ask AI

base on Investigates optimizing 'wc', the Unix word count program # wc2 - asynchronous state machine parsing There have been multiple articles lately implementing the classic `wc` program in various programming *languages*, to "prove" their favorite language can be "just as fast" as C. This project does something different. Instead of a different *language* it uses a different *algorithm*. The new algorithm is significantly faster -- implementing in a slow language like JavaScript is still faster than the original `wc` program written in C. The algorithm is known as an "asynchronous state-machine parser". It's a technique for *parsing* that you don't learn in college. It's more *efficient*, but more importantly, it's more *scalable*. That's why your browser uses a state-machine to parse GIFs, and most web servers use state-machiens to parse incoming HTTP requests. This projects contains three versions: * `wc2o.c` is a simplified 25 line version highlighting the idea * `wc2.c` is the full version in C, supporting Unicode * `wc2.js` is the version in JavaScript ## The basic algorithm The algorithm reads input and passes each byte one at a time to a state-machine. It looks something like: ```c length = fread(buf, 1, sizeof(buf), fp); for (i=0; i<length; i++) { c = buf[i]; state = table[state][c]; counts[state]++; } ``` No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine. The state-machine table is the difference between the simple version (`wc2o.c`) and complex version (`wc2.c`) of the program. The algorithm is the same, the one shown above, the difference is in how they setup the table. The simple program creates a table for ASCII, the complex program creates a much larger table supporting UTF-8. ## How `wc` works The `wc` word-count program counts the number of words in a file. A "word" is some non-space characters separate by space. Those who re-implement `wc` simplify the problem by only doing ASCII instead of the full UTF-8 Unicode. This is cheating, because much of the speed of `wc` comes from its need to handle character-sets like UTF-8. The real programs spend most of their time in functions like `mbrtowc()` to parse multi-byte characters and `iswspace()` to test if they are spaces -- which re-implementations of `wc` skip. For this reason, we've implemented a full UTF-8 version in this project, to prove that it works without cheating. Now the real `wc` works with a lot more character-sets, and we don't do that. But by implementing UTF-8, we've shown that it's possible, and that the speed for any character-set is the same. Another simplification is how invalid input is handled. The original `wc` program largley ignores errors, but it's still an important factor in making sure you are doing things correctly. ## Benchmark input files This project uses a number of large input files for benchmarking. The traditional `wc` program has wildly different performance depending upon input, such as whether the file is full of illegal characters, or whether UTF-8 is being handled. The first test file is downloaded from the Internet as "real-world data", while the others are generated using a program built with this project (`wctool`). * `pocorgtfo18.pdf` a large 92-million byte PDF file that contains binary/illegal characters * `ascii.txt` a file the same size containing random words, ASCII-only * `utf8.txt` a file containing random UTF-8 sequences of 1, 2, 3, and 4 bytes * `word.txt` a file containing 92-million 'x' characters * `space.txt` a file containing 92-million ' ' (space) characters Before benchmarking the old `wc`, set the character-set to UTF-8. It's probably already set to this on new systems, but do this to make sure: $ export LC_CTYPE=en_US.UTF-8 When running `wc`, the `-lwc` is the default for counting words in ASCII text. To convert it into UTF-8 "multi-byte" mode, change `c` o `m`, as in `-lwm`. The numbers are reported come from the Unix `time` command, the number of seconds for `user` time. In other words, `elapsed` time or `system` time aren't reported. The following table shows benchmarking a 2019 x86 MacBook Air of the old `wc` program. As you can see, it has a wide variety of speeds depending on input. The `wc` program included with macOS and Linux are completely different. Therefore, the following table shows them benchmarked against each other on the same hardware. | Command | Input File | macOS | Linux | |---------|---------------|------:|------:| | wc -lwc |pocorgtfo18.pdf|0.709 | 5.591 | | wc -lwm |pocorgtfo18.pdf|0.693 | 5.419 | | wc -lwc |ascii.txt |0.296 | 2.509 | | wc -lwm |utf8.txt |0.532 | 1.840 | | wc -lwc |space.txt |0.296 | 0.284 | | wc -lwm |space.txt |0.295 | 0.298 | | wc -lwc |word.txt |0.302 | 1.268 | | wc -lwm |word.txt |0.294 | 1.337 | These results tell us: * Illegal characters (in `pocorgtfo18.pdf`) slow things down a lot, twice as slow on macOS, 10x slower on Linux. * Text that randomly switches between spaces and words is much slower than text containing all the same character. * On Linux, the code path that reads all spaces is significantly faster. * The macOS program is in general much faster than the Linux version. * Processing Unicode (the file `utf8.txt` with the `-m` option) is slower than processing ASCII (the file `ascii.txt` with the `-c` option). ## Our benchmarks The time for our algorithm, in C and JavaScript, are the following. The state-machine parser is immune to input type, all the input files show the same results. | Program | Input File | macOS | Linux | |---------|--------------|------:|------:| | wc2.c | (all) |0.206 | 0.278 | | wc2.js | (all) |0.281 | 0.488 | These results tell us: * This state machine approach always results in the same speed, regardless of input. * This state machine approach is faster than the built-in programs. * Even written in JavaScript, the state machine approach is competitive in speed. * The difference in macOS and Linux speed is actually the difference in `clang` and `gcc` speed. The LLVM `clang` compiler is doing better optimizations for x86 processors here. * I don't know why Node.js behaves differently on macOS and Linux, it's probably just due to different versions. * A JIT (like NodeJS) works well with simple compute algorithms. This tells us little about it's relative performance in larger programs. All languages that have a JIT should compile this sort of algorithm to roughly the same speed. ## Asynchronous scalability The algorithm is *faster*, but more importantly, it's more *scalable*. Such scalability isn't usefull for `wc`, but is incredibly important for network programs. Consider an HTTP web-server. The traditional way that the Apache web-server worked was by reading the entire header in and buffering it, before then parsing the header. This need to buffer the entire header caused an enormous scalability problem. In contrast, asynchronous web-servers like Nginx use a state-machine parser. They parse the bytes as they arrive, and discard them. This is analogous to NFA and DFA regular-expressions. If you use the NFA approach, you need to buffer the entire chunk of data, so that the regex can backtrack. Using the DFA approach, input can be provided as a stream, one byte at a time, without needing buffering. DFAs are more scalable than NFAs. ## State machine parsers This project contains a minimalistic `wc2o.c` program to highlight the algorithm, without all the fuss of building UTF-8 tables, supporting only ASCII. ```c #include <stdio.h> int main(void) { static const unsigned char table[4][4] = { {2,0,1,0,}, {2,0,1,0,}, {3,0,1,0,}, {3,0,1,0,} }; static const unsigned char column[256] = { 0,0,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0, }; unsigned long counts[4] = {0,0,0,0}; int state = 0; int c; while ((c = getchar()) != EOF) { state = table[state][column[c]]; counts[state]++; } printf("%lu %lu %lu\n", counts[1], counts[2], counts[0] + counts[1] + counts[2] + counts[3]); return 0; } ``` The key part that does all the word counting is in the two lines inside: ```c while ((c = getchar()) != EOF) { state = table[state][column[c]]; counts[state]++; } ``` This is only defined for ASCII, so you can see the state-machine on a single-line in the code (`table`). ## Additional tools This project includes additional tools: * `wctool` to generate large test files * `wcdiff` to find difference between two implementatins of `wc` * `wcstream` to fragment input files (demonstrates a bug in macOS's `wc`) The program `wc2.c` has the same logic, the difference being that it generates a larger state-machine for parsing UTF-8. ## Pointer arithmetic C has a peculiar idiom called "pointer arithmetic", where pointers can be incremented. Looping through a buffer is done with an expression like `*buf++` instead of `buf[i++]`. Many programmers think pointer-arithmetic is faster. To test this, the `wc2.c` program has an option `-P` that makes this small change, to test the difference in speed. ", Assign "at most 3 tags" to the expected json: {"id":"10910","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts

AI prompts