AI prompts
base on Code for collecting, processing, and preparing datasets for the Common Pile # The Common Pile
This repository tracks the code used to collect, process, and prepare the datasets for the Common Pile.
The code used for the preparation of each source in the Common Pile can be found in the `sources/` subdirectory.
Source-agnostic utility code and scripts are provided in the `common_pile` package.
If you are looking for the data itself or our trained models, please see [our Hugging Face organization](https://huggingface.co/common-pile/).
## Installation
The majority of packages required for dataset creation can be installed with `pip install -r requirements.txt`.
To make use of the shared functionality in the `common_pile` pckage, run `pip install -e .`.
If you are on a system that doesn't support automatic installation of pandoc with `pypandoc_binary`, change it to `pypandoc` in the `requirements.txt` and and install pandoc manually.
## Contributing
If you'd like to contribute a new source to the Common Pile, please [start an issue](https://github.com/r-three/common-pile/issues/new) to share details of the source.
Generally, we expect each source to include code that 1) downloads the data, 2) processes it appropriately to retain primarily plain text, and 3) write out the results in the Dolma format (gzipped jsonl).
You can find utilities to help with each of these steps in the `common_pile` library.
Alternatively, you can look at our existing sources for ideas as to how to prepare a source.
We use git pre-commit hooks to format code and keep style consistent.
You can install the pre-commit libraries with `pip install pre-commit` and insert the pre-commit hooks with `pre-commit install` from the repository root.
## Tips
The [scripts subdirectory](https://github.com/r-three/common-pile/tree/main/common_pile/scripts) has various scripts that can be helpful for inspecting or computing statistics over data.
Alternatively, the Dolma-formatted files can be inspected with [`jq`](https://jqlang.org/) by running
```
cat ${file}.jsonl.gz | gunzip | jq -s ${commmand}
```
", Assign "at most 3 tags" to the expected json: {"id":"13985","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"