AI prompts
base on A faster CSV parser in 5KB (min) ## π ΞΌDSV
A [faster](#performance) CSV parser in [5KB (min)](https://github.com/leeoniya/uDSV/tree/main/dist/uDSV.iife.min.js) _(MIT Licensed)_
---
### Introduction
uDSV is a fast JS library for parsing well-formed CSV strings, either from memory or incrementally from disk or network.
It is mostly [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180) compliant, with support for quoted values containing commas, escaped quotes, and line breaksΒΉ.
The aim of this project is to handle the 99.5% use-case without adding complexity and performance trade-offs to support the remaining 0.5%.
ΒΉ Line breaks (`\n`,`\r`,`\r\n`) within quoted values must match the row separator.
---
### Features
What does uDSV pack into 5KB?
- [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180) compliant
- Incremental or full parsing, with optional accumulation
- Auto-detection and customization of delimiters (rows, columns, quotes, escapes)
- Schema inference and value typing: `string`, `number`, `boolean`, `date`, `json`
- Defined handling of `''`, `'null'`, `'NaN'`
- Whitespace trimming of values & skipping empty lines
- Multi-row header skipping and column renaming
- Multiple outputs: arrays (tuples), objects, nested objects, columnar arrays
Of course, _most_ of these are table stakes for CSV parsers :)
---
### Performance
Is it Lightning Fastβ’ or Blazing Fastβ’?
No, those are too slow! uDSV has [Ludicrous Speedβ’](https://www.youtube.com/watch?v=ygE01sOhzz0);
it's faster than the parsers you recognize and faster than those you've never heard of.
Most CSV parsers have one happy/fast path -- the one without quoted values, without value typing, and only when using the default settings & output format.
Once you're off that path, you can generally throw any self-promoting benchmarks in the trash.
In contrast, uDSV remains fast with any datasets and all options; its happy path is _every path_.
On a Ryzen 7 ThinkPad, Linux v6.13.3, and NodeJS v22.14.0, a diverse set of benchmarks show a 1x-5x performance boost relative to the [popular](https://github.com/search?q=csv+parser&type=repositories&s=stars&o=desc), [proven-fast](https://leanylabs.com/blog/js-csv-parsers-benchmarks/), [Papa Parse](https://www.papaparse.com/).
<pre>
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β customers-100000.csv (17 MB, 12 cols x 100K rows) (parsing to strings) β
ββββββββββββββββββββββββββ¬βββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Name β Rows/s β Throughput (MiB/s) β
ββββββββββββββββββββββββββΌβββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β csv-simple-parser β 1.45M β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ 240 β
β uDSV β 1.39M β βββββββββββββββββββββββββββββββββββββββββββββββββββββ 230 β
β PapaParse β 1.13M β βββββββββββββββββββββββββββββββββββββββββββ 187 β
β tiddlycsv β 1.09M β ββββββββββββββββββββββββββββββββββββββββββ 180 β
β ACsv β 1.07M β βββββββββββββββββββββββββββββββββββββββββ 176 β
β but-csv β 1.05M β ββββββββββββββββββββββββββββββββββββββββ 174 β
β d3-dsv β 987K β ββββββββββββββββββββββββββββββββββββββ 163 β
β csv-rex β 887K β ββββββββββββββββββββββββββββββββββ 147 β
β csv42 β 781K β ββββββββββββββββββββββββββββββ 129 β
β achilles-csv-parser β 687K β βββββββββββββββββββββββββββ 114 β
β arquero β 567K β βββββββββββββββββββββββ 93.6 β
β comma-separated-values β 545K β βββββββββββββββββββββ 90 β
β node-csvtojson β 456K β ββββββββββββββββββ 75.3 β
β @vanillaes/csv β 427K β βββββββββββββββββ 70.5 β
β SheetJS β 415K β ββββββββββββββββ 68.5 β
β csv-parser (neat-csv) β 307K β ββββββββββββ 50.7 β
β CSVtoJSON β 297K β ββββββββββββ 49.1 β
β dekkai β 221K β βββββββββ 36.5 β
β csv-js β 206K β ββββββββ 34.1 β
β @gregoranders/csv β 202K β ββββββββ 33.3 β
β csv-parse/sync β 177K β βββββββ 29.3 β
β jquery-csv β 155K β ββββββ 25.6 β
β @fast-csv/parse β 114K β βββββ 18.9 β
β utils-dsv-base-parse β 74.3K β βββ 12.3 β
ββββββββββββββββββββββββββ΄βββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
</pre>
You might be thinking, "Okay, it's not _that_ much faster than PapaParse".
But things change significantly when parsing with types.
PapaParse is 50% slower without even creating the 100k `Date` objects that other libs do.
<pre>
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β customers-100000.csv (17 MB, 12 cols x 100K rows) (parsing with types) β
ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββ€
β Name β Rows/s β Throughput (MiB/s) β Types β
ββββββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββ€
β uDSV β 993K β ββββββββββββββββββββββββββββββββββ 164 β date,number,string β
β csv42 β 686K β ββββββββββββββββββββββββ 113 β number,string β
β csv-simple-parser β 666K β βββββββββββββββββββββββ 110 β date,number,string β
β csv-rex β 627K β ββββββββββββββββββββββ 104 β number,string β
β comma-separated-values β 536K β βββββββββββββββββββ 88.5 β number,string β
β achilles-csv-parser β 517K β ββββββββββββββββββ 85.3 β number,string β
β arquero β 478K β βββββββββββββββββ 79 β date,number,string β
β PapaParse β 463K β ββββββββββββββββ 76.4 β number,string β
β d3-dsv β 389K β ββββββββββββββ 64.3 β date,number,string β
β @vanillaes/csv β 312K β βββββββββββ 51.5 β NaN,number,string β
β CSVtoJSON β 284K β ββββββββββ 46.8 β number,string β
β csv-parser (neat-csv) β 265K β ββββββββββ 43.7 β number,string β
β csv-js β 211K β ββββββββ 34.8 β number,string β
β dekkai β 209K β ββββββββ 34.6 β number,string β
β csv-parse/sync β 101K β ββββ 16.7 β date,number,string β
β SheetJS β 64.5K β βββ 10.7 β number,string β
ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββ
</pre>
And when the dataset also has many quoted values, the performance gap grows to 3x.
Once again, we're ignoring the fact that results with "object" types ran `JSON.parse()` 34k times.
<pre>
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β uszips.csv (6 MB, 18 cols x 34K rows) (parsing with types) β
ββββββββββββββββββββββββββ¬βββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β Name β Rows/s β Throughput (MiB/s) β Types β
ββββββββββββββββββββββββββΌβββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β uDSV β 521K β ββββββββββββββββββββ 93 β boolean,null,number,object,string β
β csv-simple-parser β 416K β ββββββββββββββββ 74.3 β boolean,null,number,object,string β
β achilles-csv-parser β 342K β ββββββββββββββ 61.2 β boolean,null,number,object,string β
β d3-dsv β 284K β βββββββββββ 50.8 β null,number,string β
β csv-rex β 267K β βββββββββββ 47.7 β boolean,null,number,object,string β
β comma-separated-values β 262K β βββββββββββ 46.7 β number,string β
β dekkai β 258K β ββββββββββ 46.1 β NaN,number,string β
β arquero β 251K β ββββββββββ 44.9 β null,number,string β
β CSVtoJSON β 236K β ββββββββββ 42.2 β number,string β
β csv42 β 225K β βββββββββ 40.1 β number,object,string β
β csv-js β 215K β βββββββββ 38.4 β boolean,number,string β
β csv-parser (neat-csv) β 198K β ββββββββ 35.3 β boolean,null,number,object,string β
β @vanillaes/csv β 179K β βββββββ 32 β NaN,number,string β
β PapaParse β 176K β βββββββ 31.4 β boolean,null,number,string β
β SheetJS β 98.6K β ββββ 17.6 β boolean,number,string β
β csv-parse/sync β 91.8K β ββββ 16.4 β number,string β
ββββββββββββββββββββββββββ΄βββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ
</pre>
For _way too many_ synthetic and real-world benchmarks, head over to [/bench](/bench)...and don't forget your coffee!
---
### Installation
```
npm i udsv
```
or
```html
<script src="./dist/uDSV.iife.min.js"></script>
```
---
### API
A 150 LoC [uDSV.d.ts](https://github.com/leeoniya/uDSV/blob/main/dist/uDSV.d.ts) TypeScript def.
---
### Basic Usage
```js
import { inferSchema, initParser } from 'udsv';
let csvStr = 'a,b,c\n1,2,3\n4,5,6';
let schema = inferSchema(csvStr);
let parser = initParser(schema);
// native format (fastest)
let stringArrs = parser.stringArrs(csvStr); // [ ['1','2','3'], ['4','5','6'] ]
// typed formats (internally converted from native)
let typedArrs = parser.typedArrs(csvStr); // [ [1, 2, 3], [4, 5, 6] ]
let typedObjs = parser.typedObjs(csvStr); // [ {a: 1, b: 2, c: 3}, {a: 4, b: 5, c: 6} ]
let typedCols = parser.typedCols(csvStr); // [ [1, 4], [2, 5], [3, 6] ]
let stringObjs = parser.stringObjs(csvStr); // [ {a: '1', b: '2', c: '3'}, {a: '4', b: '5', c: '6'} ]
let stringCols = parser.stringCols(csvStr); // [ ['1', '4'], ['2', '5'], ['3', '6'] ]
```
Sometimes you may need to render the unmodified string values (like in an editable grid), but want to sort/filter using the typed values (e.g. number or date columns).
uDSV's `.typed*()` methods additionally accept the untyped string-tuples array returned by `parser.stringArrs(csvStr)`:
```js
let schema = inferSchema(csvStr);
let parser = initParser(schema);
// raw parsed strings for rendering
let stringArrs = parser.stringArrs(csvStr);
// typed values for sorting/filtering
let typedObjs = parser.typedObjs(stringArrs);
```
Need a custom or user-defined parser for a specific column?
No problem!
```js
const csvStr = `a,b,c\n1,2,a-b-c\n4,5,d-e`;
let schema = inferSchema(csvStr);
schema.cols[2].parse = str => str.split('-');
let parser = initParser(schema);
let rows = parser.typedObjs(csvStr);
/*
[
{a: 1, b: 2, c: ['a', 'b', 'c']},
{a: 4, b: 5, c: ['d', 'e', ]},
]
*/
```
Nested/deep objects can be re-constructed from column naming via `.typedDeep()`:
```js
// deep/nested objects (from column naming)
let csvStr2 = `
_type,name,description,location.city,location.street,location.geo[0],location.geo[1],speed,heading,size[0],size[1],size[2]
item,Item 0,Item 0 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,3.4,5.1,0.9
`.trim();
let schema2 = inferSchema(csvStr2);
let parser2 = initParser(schema2);
let typedDeep = parser2.typedDeep(csvStr2);
/*
[
{
_type: 'item',
name: 'Item 0',
description: 'Item 0 description in text',
location: {
city: 'Rotterdam',
street: 'Main street',
geo: [ 51.9280712, 4.4207888 ]
},
speed: 5.4,
heading: 128.3,
size: [ 3.4, 5.1, 0.9 ],
}
]
*/
```
**CSP Note:**
uDSV uses dynamically-generated functions (via `new Function()`) for its `.typed*()` methods.
These functions are lazy-generated and use `JSON.stringify()` [code-injection guards](https://github.com/leeoniya/uDSV/commit/4e7472a7015c0a7ae5ae76e41f282bd4bdcf0c67), so the risk should be minimal.
Nevertheless, if you have strict [CSP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) without `unsafe-eval`, you won't be able to take advantage of the typed methods and will have to do the type conversion from the string tuples yourself.
---
### Incremental / Streaming
uDSV has no inherent knowledge of streams.
Instead, it exposes a generic incremental parsing API to which you can pass sequential chunks.
These chunks can come from various sources, such as a [Web Stream](https://css-tricks.com/web-streams-everywhere-and-fetch-for-node-js/) or [Node stream](https://nodejs.org/api/stream.html) via `fetch()` or `fs`, a [WebSocket](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API), etc.
Here's what it looks like with Node's [fs.createReadStream()](https://nodejs.org/api/fs.html#fscreatereadstreampath-options):
```js
let stream = fs.createReadStream(filePath);
let parser = null;
let result = null;
stream.on('data', (chunk) => {
// convert from Buffer
let strChunk = chunk.toString();
// on first chunk, infer schema and init parser
parser ??= initParser(inferSchema(strChunk));
// incremental parse to string arrays
parser.chunk(strChunk, parser.stringArrs);
});
stream.on('end', () => {
result = parser.end();
});
```
...and Web streams [in Node](https://nodejs.org/api/webstreams.html), or [Fetch's Response.body](https://developer.mozilla.org/en-US/docs/Web/API/Response/body):
```js
let stream = fs.createReadStream(filePath);
let webStream = Stream.Readable.toWeb(stream);
let textStream = webStream.pipeThrough(new TextDecoderStream());
let parser = null;
for await (const strChunk of textStream) {
parser ??= initParser(inferSchema(strChunk));
parser.chunk(strChunk, parser.stringArrs);
}
let result = parser.end();
```
The above examples show accumulating parsers -- they will buffer the full `result` into memory.
This may not be something you need (or want), for example with huge datasets where you're looking to get the sum of a single column, or want to filter only a small subset of rows.
To bypass this auto-accumulation behavior, simply pass your own handler as the third argument to `parser.chunk()`:
```js
// ...same as above
let sum = 0;
// sums fourth column
let reducer = (row) => {
sum += row[3];
};
for await (const strChunk of textStream) {
parser ??= initParser(inferSchema(strChunk));
parser.chunk(strChunk, parser.typedArrs, reducer); // typedArrs + reducer
}
parser.end();
```
Building on the non-accumulating example, Node's [Transform stream](https://nodejs.org/api/stream.html#implementing-a-transform-stream) will be something like:
```js
import { Transform } from "stream";
class ParseCSVTransform extends Transform {
#parser = null;
#push = null;
constructor() {
super({ objectMode: true });
this.#push = parsed => {
this.push(parsed);
};
}
_transform(chunk, encoding, callback) {
let strChunk = chunk.toString();
this.#parser ??= initParser(inferSchema(strChunk));
this.#parser.chunk(strChunk, this.#parser.typedArrs, this.#push);
callback();
}
_flush(callback) {
this.#parser.end();
callback();
}
}
```
---
### TODO?
- handle #comment rows
- emit empty-row and #comment events?", Assign "at most 3 tags" to the expected json: {"id":"476","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"