base on A system for agentic LLM-powered data processing and ETL # ๐Ÿ“œ DocETL: Powering Complex Document Processing Pipelines [![Website](https://img.shields.io/badge/Website-docetl.org-blue)](https://docetl.org) [![Documentation](https://img.shields.io/badge/Documentation-docs-green)](https://ucbepic.github.io/docetl) [![Discord](https://img.shields.io/discord/1285485891095236608?label=Discord&logo=discord)](https://discord.gg/fHp7B2X3xx) [![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2410.12189) ![DocETL Figure](docs/assets/readmefig.png) DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers: 1. An interactive UI playground for iterative prompt engineering and pipeline development 2. A Python package for running production pipelines from the command line or Python code > ๐Ÿ’ก **Need Help Writing Your Pipeline?** > You can use **Claude Code** (recommended) to help you write your pipelineโ€”see the quickstart: https://ucbepic.github.io/docetl/quickstart-claude-code/ > If youโ€™d rather use ChatGPT or the Claude app, see [docetl.org/llms.txt](https://docetl.org/llms.txt) for a big prompt you can copy/paste before describing your task. ### ๐ŸŒŸ Community Projects - [Conversation Generator](https://github.com/PassionFruits-net/docetl-conversation) - [Text-to-speech](https://github.com/PassionFruits-net/docetl-speaker) - [YouTube Transcript Topic Extraction](https://github.com/rajib76/docetl_examples) ### ๐Ÿ“š Educational Resources - [UI/UX Thoughts](https://x.com/sh_reya/status/1846235904664273201) - [Using Gleaning to Improve Output Quality](https://x.com/sh_reya/status/1843354256335876262) - [Deep Dive on Resolve Operator](https://x.com/sh_reya/status/1840796824636121288) ## ๐Ÿš€ Getting Started There are two main ways to use DocETL: ### 1. ๐ŸŽฎ DocWrangler, the Interactive UI Playground (Recommended for Development) [DocWrangler](https://docetl.org/playground) helps you iteratively develop your pipeline: - Experiment with different prompts and see results in real-time - Build your pipeline step by step - Export your finalized pipeline configuration for production use ![DocWrangler](docs/assets/tutorial/one-operation.png) DocWrangler is hosted at [docetl.org/playground](https://docetl.org/playground). But to run the playground locally, you can either: - Use Docker (recommended for quick start): `make docker` - Set up the development environment manually See the [Playground Setup Guide](https://ucbepic.github.io/docetl/playground/) for detailed instructions. ### 2. ๐Ÿ“ฆ Python Package (For Production Use) If you want to use DocETL as a Python package: #### Prerequisites - Python 3.10 or later - OpenAI API key ```bash pip install docetl ``` Create a `.env` file in your project directory: ```bash OPENAI_API_KEY=your_api_key_here # Required for LLM operations (or the key for the LLM of your choice) ``` > โš ๏ธ **Important: Two Different .env Files** > - **Root `.env`**: Used by the backend Python server that executes DocETL pipelines > - **`website/.env.local`**: Used by the frontend TypeScript code in DocWrangler (UI features like improve prompt and chatbot) To see examples of how to use DocETL, check out the [tutorial](https://ucbepic.github.io/docetl/tutorial/). ### 2. ๐ŸŽฎ DocWrangler Setup To run DocWrangler locally, you have two options: #### Option A: Using Docker (Recommended for Quick Start) The easiest way to get the DocWrangler playground running: 1. Create the required environment files: Create `.env` in the root directory (for the backend Python server that executes pipelines): ```bash OPENAI_API_KEY=your_api_key_here # Used by DocETL pipeline execution engine # BACKEND configuration BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000 BACKEND_HOST=localhost BACKEND_PORT=8000 BACKEND_RELOAD=True # FRONTEND configuration FRONTEND_HOST=0.0.0.0 FRONTEND_PORT=3000 # Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml) FRONTEND_DOCKER_COMPOSE_PORT=3031 BACKEND_DOCKER_COMPOSE_PORT=8081 # Supported text file encodings TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1 ``` Create `.env.local` in the `website` directory (for DocWrangler UI features like improve prompt and chatbot): ```bash OPENAI_API_KEY=sk-xxx # Used by TypeScript features: improve prompt, chatbot, etc. OPENAI_API_BASE=https://api.openai.com/v1 MODEL_NAME=gpt-4o-mini # Model used by the UI assistant NEXT_PUBLIC_BACKEND_HOST=localhost NEXT_PUBLIC_BACKEND_PORT=8000 NEXT_PUBLIC_HOSTED_DOCWRANGLER=false ``` 2. Run Docker: ```bash make docker ``` This will: - Create a Docker volume for persistent data - Build the DocETL image - Run the container with the UI accessible at http://localhost:3000 To clean up Docker resources (note that this will delete the Docker volume): ```bash make docker-clean ``` ##### AWS Bedrock This framework supports integration with AWS Bedrock. To enable: 1. Configure AWS credentials: ```bash aws configure ``` 2. Test your AWS credentials: ```bash make test-aws ``` 3. Run with AWS support: ```bash AWS_PROFILE=your-profile AWS_REGION=your-region make docker ``` Or using Docker Compose: ```bash AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up ``` Environment variables: - `AWS_PROFILE`: Your AWS CLI profile (default: 'default') - `AWS_REGION`: AWS region (default: 'us-west-2') Bedrock models are pefixed with `bedrock`. See liteLLM [docs](https://docs.litellm.ai/docs/providers/bedrock#supported-aws-bedrock-models) for more details. #### Option B: Manual Setup (Development) For development or if you prefer not to use Docker: 1. Clone the repository: ```bash git clone https://github.com/ucbepic/docetl.git cd docetl ``` 2. Set up environment variables in `.env` in the root/top-level directory (for the backend Python server): ```bash OPENAI_API_KEY=your_api_key_here # Used by DocETL pipeline execution engine # BACKEND configuration BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000 BACKEND_HOST=localhost BACKEND_PORT=8000 BACKEND_RELOAD=True # FRONTEND configuration FRONTEND_HOST=0.0.0.0 FRONTEND_PORT=3000 # Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml) FRONTEND_DOCKER_COMPOSE_PORT=3031 BACKEND_DOCKER_COMPOSE_PORT=8081 # Supported text file encodings TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1 ``` And create an .env.local file in the `website` directory (for DocWrangler UI features): ```bash OPENAI_API_KEY=sk-xxx # Used by TypeScript features: improve prompt, chatbot, etc. OPENAI_API_BASE=https://api.openai.com/v1 MODEL_NAME=gpt-4o-mini # Model used by the UI assistant NEXT_PUBLIC_BACKEND_HOST=localhost NEXT_PUBLIC_BACKEND_PORT=8000 NEXT_PUBLIC_HOSTED_DOCWRANGLER=false ``` 3. Install dependencies: ```bash make install # Install Python deps with uv and set up pre-commit make install-ui # Install UI dependencies ``` If you prefer using uv directly instead of Make: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh uv sync --all-groups --all-extras ``` 4. Start the development server: ```bash make run-ui-dev ``` 5. Visit http://localhost:3000/playground to access the interactive UI. ### ๐Ÿ› ๏ธ Development Setup If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite: ```bash make tests-basic # Runs basic test suite (costs < $0.01 with OpenAI) ``` For detailed documentation and tutorials, visit our [documentation](https://ucbepic.github.io/docetl). ", Assign "at most 3 tags" to the expected json: {"id":"12035","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"