base on Vision infrastructure to turn complex documents into RAG/LLM-ready data <br />
<div align="center">
<a href="https://github.com/lumina-ai-inc/chunkr">
<img src="images/logo.svg" alt="Logo" width="80" height="80">
</a>
<h3 align="center">Chunkr | Open Source Document Intelligence API</h3>
<p align="center">
Production-ready API service for document layout analysis, OCR, and semantic chunking.<br />Convert PDFs, PPTs, Word docs & images into RAG/LLM-ready chunks.
<br /><br />
<b>Layout Analysis</b> | <b>OCR + Bounding Boxes</b> | <b>Structured HTML and markdown</b> | <b>VLM Processing controls</b>
<br />
<br />
<a href="https://www.chunkr.ai">Try it out!</a>
ยท
<a href="https://github.com/lumina-ai-inc/chunkr/issues/new">Report Bug</a>
ยท
<a href="#connect-with-us">Contact</a>
ยท
<a href="https://discord.gg/XzKWFByKzW">Discord</a>
</p>
</div>
<div align="center">
<a href="https://www.chunkr.ai" width="1200" height="630">
<img src="https://chunkr.ai/og-image.png" style="bor">
</a>
</div>
## Table of Contents
- [Table of Contents](#table-of-contents)
- [(Super) Quick Start](#super-quick-start)
- [Documentation](#documentation)
- [Self-Hosted Deployment Options](#self-hosted-deployment-options)
- [Quick Start with Docker Compose](#quick-start-with-docker-compose)
- [HTTPS Setup for Docker Compose](#https-setup-for-docker-compose)
- [Deployment with Kubernetes](#deployment-with-kubernetes)
- [LLM Configuration](#llm-configuration)
- [Using models.yaml (Recommended)](#using-modelsyaml-recommended)
- [Using environment variables (Basic)](#using-environment-variables-basic)
- [Common LLM API Providers](#common-llm-api-providers)
- [Licensing](#licensing)
- [Connect With Us](#connect-with-us)
## (Super) Quick Start
1. Go to [chunkr.ai](https://www.chunkr.ai)
2. Make an account and copy your API key
3. Install our Python SDK:
```bash
pip install chunkr-ai
```
4. Use the SDK to process your documents:
```python
from chunkr_ai import Chunkr
# Initialize with your API key from chunkr.ai
chunkr = Chunkr(api_key="your_api_key")
# Upload a document (URL or local file path)
url = "https://chunkr-web.s3.us-east-1.amazonaws.com/landing_page/input/science.pdf"
task = chunkr.upload(url)
# Export results in various formats
html = task.html(output_file="output.html")
markdown = task.markdown(output_file="output.md")
content = task.content(output_file="output.txt")
task.json(output_file="output.json")
# Clean up
chunkr.close()
```
## Documentation
Visit our [docs](https://docs.chunkr.ai) for more information and examples.
## Self-Hosted Deployment Options
### Quick Start with Docker Compose
1. Prerequisites:
- [Docker and Docker Compose](https://docs.docker.com/get-docker/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) (for GPU support, optional)
2. Clone the repo:
```bash
git clone https://github.com/lumina-ai-inc/chunkr
cd chunkr
```
3. Set up environment variables:
```bash
# Copy the example environment file
cp .env.example .env
# Configure your llm models
cp models.example.yaml models.yaml
```
For more information on how to set up LLMs, see [here](#llm-configuration).
4. Start the services:
```bash
# For GPU deployment, use the following command:
docker compose up -d
# For CPU deployment, use the following command:
docker compose -f compose-cpu.yaml up -d
# For Mac ARM architecture (eg. M2, M3 etc.) deployment, use the following command:
docker compose -f compose-cpu.yaml -f compose-mac.yaml up -d
```
5. Access the services:
- Web UI: `http://localhost:5173`
- API: `http://localhost:8000`
6. Stop the services when done:
```bash
# For GPU deployment, use the following command:
docker compose down
# For CPU deployment, use the following command:
docker compose -f compose-cpu.yaml down
# For Mac ARM architecture (eg. M2, M3 etc.) deployment, use the following command:
docker compose -f compose-cpu.yaml -f compose-mac.yaml down
```
#### HTTPS Setup for Docker Compose
This section explains how to set up HTTPS using a self signed certificate with Docker Compose when hosting Chunkr on a VM. This allows you to access the web UI, API, Keycloak (authentication service) and MinIO (object storage service) over HTTPS.
1. Generate a self-signed certificate:
```bash
# Create a certs directory
mkdir certs
# Generate the certificate
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout certs/nginx.key -out certs/nginx.crt -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
```
2. Update the .env file with your VM's IP address:
> **Important**: Replace all instances of "localhost" with your VM's actual IP address. Note that you must use "https://" instead of "http://" and the ports are different from the HTTP setup (No port for web, 8444 for API, 8443 for Keycloak, 9100 for MinIO):
```bash
AWS__PRESIGNED_URL_ENDPOINT=https://your_vm_ip_address:9100
WORKER__SERVER_URL=https://your_vm_ip_address:8444
VITE_API_URL=https://your_vm_ip_address:8444
VITE_KEYCLOAK_POST_LOGOUT_REDIRECT_URI=https://your_vm_ip_address
VITE_KEYCLOAK_REDIRECT_URI=https://your_vm_ip_address
VITE_KEYCLOAK_URL=https://your_vm_ip_address:8443
```
1. Start the services:
```bash
# For GPU deployment, use the following command:
docker compose --profile proxy up -d
# For CPU deployment, use the following command:
docker compose -f compose-cpu.yaml --profile proxy up -d
# For Mac ARM architecture (eg. M2, M3 etc.) deployment, use the following command:
docker compose -f compose-cpu.yaml -f compose-mac.yaml --profile proxy up -d
```
4. Access the services:
- Web UI: `https://your_vm_ip_address`
- API: `https://your_vm_ip_address:8444`
5. Stop the services when done:
```bash
# For GPU deployment, use the following command:
docker compose --profile proxy down
# For CPU deployment, use the following command:
docker compose -f compose-cpu.yaml --profile proxy down
# For Mac ARM architecture (eg. M2, M3 etc.) deployment, use the following command:
docker compose -f compose-cpu.yaml -f compose-mac.yaml --profile proxy down
```
### Deployment with Kubernetes
For production environments, we provide a Helm chart and detailed deployment instructions:
1. See our detailed guide at [`kube/README.md`](kube/README.md)
2. Includes configurations for high availability and scaling
For enterprise support and deployment assistance, [contact us](mailto:
[email protected]).
## LLM Configuration
Chunkr supports two ways to configure LLMs:
1. **models.yaml file**: Advanced configuration for multiple LLMs with additional options
2. **Environment variables**: Simple configuration for a single LLM
### Using models.yaml (Recommended)
For more flexible configuration with multiple models, default/fallback options, and rate limits:
1. Copy the example file to create your configuration:
```bash
cp models.example.yaml models.yaml
```
2. Edit the models.yaml file with your configuration. Example:
```yaml
models:
- id: gpt-4o
model: gpt-4o
provider_url: https://api.openai.com/v1/chat/completions
api_key: "your_openai_api_key_here"
default: true
rate-limit: 200 # requests per minute - optional
```
Benefits of using models.yaml:
- Configure multiple LLM providers simultaneously
- Set default and fallback models
- Add distributed rate limits per model
- Reference models by ID in API requests (see docs for more info)
>Read the `models.example.yaml` file for more information on the available options.
### Using environment variables (Basic)
You can use any OpenAI API compatible endpoint by setting the following variables in your .env file:
```
LLM__KEY:
LLM__MODEL:
LLM__URL:
```
### Common LLM API Providers
Below is a table of common LLM providers and their configuration details to get you started:
| Provider | API URL | Documentation |
| ---------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------- |
| OpenAI | https://api.openai.com/v1/chat/completions | [OpenAI Docs](https://platform.openai.com/docs) |
| Google AI Studio | https://generativelanguage.googleapis.com/v1beta/openai/chat/completions | [Google AI Docs](https://ai.google.dev/gemini-api/docs/openai) |
| OpenRouter | https://openrouter.ai/api/v1/chat/completions | [OpenRouter Models](https://openrouter.ai/models) |
| Self-Hosted | http://localhost:8000/v1 | [VLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) or [Ollama](https://ollama.com/blog/openai-compatibility) |
## Licensing
The core of this project is dual-licensed:
1. [GNU Affero General Public License v3.0 (AGPL-3.0)](LICENSE)
2. Commercial License
To use Chunkr without complying with the AGPL-3.0 license terms you can [contact us](mailto:
[email protected]) or visit our [website](https://chunkr.ai).
## Connect With Us
- ๐ง Email: [
[email protected]](mailto:
[email protected])
- ๐
Schedule a call: [Book a 30-minute meeting](https://cal.com/mehulc/30min)
- ๐ Visit our website: [chunkr.ai](https://chunkr.ai)
", Assign "at most 3 tags" to the expected json: {"id":"12909","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"