base on Get clean data from tricky documents, powered by vision-language models ⚡ <div align="center">
<a href="https://thepi.pe/">
<img src="https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/pipeline_small%20(1).png" alt="Pipeline Illustration" style="width:96px; height:72px; vertical-align:middle;">
<h1>thepi.pe</h1>
</a>
<a>
<img src="https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg" alt="python-gh-action">
</a>
<a href="https://codecov.io/gh/emcf/thepipe">
<img src="https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9" alt="codecov">
</a>
<a href="https://raw.githubusercontent.com/emcf/thepipe/main/LICENSE">
<img src="https://img.shields.io/badge/license-MIT-green" alt="MIT license">
</a>
<a href="https://www.pepy.tech/projects/thepipe-api">
<img src="https://static.pepy.tech/badge/thepipe-api" alt="PyPI">
</a>
</div>
## Extract clean data from tricky documents ⚡
thepi.pe is a package that can scrape clean markdown, multimodal media, and structured data from complex documents. It uses vision-language models (VLMs) under the hood for superior output quality, and works out-of-the-box with any LLM, VLM, or vector database. It can extract well-formatted data from a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, videos, audio, and more.
## Features 🌟
- Scrape clean markdown, tables, and images from any document
- Scrape text, images, video, and audio from any file or URL
- Works out-of-the-box with vision-language models, vector databases, and RAG frameworks
- AI-native filetype detection, layout analysis, and structured data extraction
- Accepts a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more
## Get started in 5 minutes 🚀
Thepipe can be installed via the command line:
```bash
pip install thepipe-api
```
If you need full functionality with media-rich sources such as webpages, video, and audio, you can choose to install the following dependencies:
```bash
apt-get update && apt-get install -y git ffmpeg
python -m playwright install --with-deps chromium
```
### Default setup (OpenAI)
By default, thepipe uses the [OpenAI API](https://platform.openai.com/docs/overview), so VLM features will work out of the box provided you have the `OPENAI_API_KEY` environment variable set.
### Custom VLM server setup (OpenRouter, OpenLLM, etc.)
If you wish to use a local vision-language model or a different cloud provider, you can set the `LLM_SERVER_BASE_URL` environment variable, for example, `https://openrouter.ai/api/v1` for [OpenRouter](https://openrouter.ai/), or `http://localhost:3000/v1` for a local server such as [OpenLLM](https://github.com/bentoml/OpenLLM). You may set the `LLM_SERVER_API_KEY` environment variable for authentication to a non-OpenAI cloud provider. You can set the `DEFAULT_AI_MODEL` environment variable to specify the model to use for VLM features (for OpenAI, this is defaulted to `gpt-4o`).
### Scraping
```python
from thepipe.scraper import scrape_file
# scrape clean markdown and images from a PDF
chunks = scrape_file(filepath="paper.pdf", ai_extraction=True)
```
### Chunking
To satisfy token limit constraints, the following chunking methods are available to split the content into smaller chunks.
- `chunk_by_document`: Returns one chunk with the entire content of the file.
- `chunk_by_page`: Returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).
- `chunk_by_length`: Splits chunks by length.
- `chunk_by_section`: Splits chunks by markdown section.
- `chunk_by_keyword`: Splits chunks at keywords
- `chunk_semantic` (experimental, requires [sentence transformers](https://pypi.org/project/sentence-transformers/)): Returns chunks split by spikes in semantic changes, with a configurable threshold.
- `chunk_agentic` (experimental, requires [OpenAI](https://pypi.org/project/openai/)): Returns chunks split by an LLM agent that attempts to find semantically meaningful sections.
For example,
```python
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_document, chunk_by_page
# optionally, pass in chunking_method
# chunk_by_document returns one chunk for the entire document
chunks = scrape_file(filepath="paper.pdf", chunking_method=chunk_by_document)
# you can also re-chunk later.
# chunk_by_page returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).
chunks = chunk_by_page(chunks)
```
### OpenAI Integration 🤖
```python
from openai import OpenAI
from thepipe.core import chunks_to_messages
# Initialize OpenAI client
client = OpenAI()
# Use OpenAI-formatted chat messages
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "What is the paper about?"
}]
}]
# Simply add the scraped chunks to the messages
messages += chunks_to_messages(chunks)
# Call LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
```
`chunks_to_messages` takes in an optional `text_only` parameter to only output text from the source document. This is useful for downstream use with LLMs that lack multimodal capabilities.
> ⚠️ **It is important to be mindful of your model's token limit.**
> Be sure your prompt is within the token limit of your model. You can use chunking to split your messages into smaller chunks.
### LLamaIndex Integration 🦙
A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.
### Structured extraction 🗂️
```python
from thepipe.extract import extract
schema = {
"description": "string",
"amount_usd": "float"
}
results, tokens_used = extract(
chunks=chunks,
schema=schema,
multiple_extractions=True, # extract multiple rows of data per chunk
)
```
## Sponsors
Please consider supporting thepipe by [becoming a sponsor](mailto:
[email protected]).
Your support helps me maintain and improve the project while helping the open source community discover your work.
Visit [Cal.com](https://cal.com/) for an open source scheduling tool that helps you book meetings with ease. It's the perfect solution for busy professionals who want to streamline their scheduling process.
<a href="https://cal.com/emmett-mcf/30min"><img alt="Book us with Cal.com" src="https://cal.com/book-with-cal-dark.svg" /></a>
Looking for enterprise-ready document processing and intelligent automation? Discover
how [Trellis AI](https://runtrellis.com/) can streamline your workflows and enhance productivity.
## How it works 🛠️
thepipe uses a combination of computer vision models and heuristics to scrape clean content from the source and process it for downstream use with [large language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision-language models](https://en.wikipedia.org/wiki/Vision_transformer). You can feed these messages directly into the model, or alternatively you can chunk these messages for downstream storage in a vector database such as ChromaDB, LLamaIndex, or equivalent RAG framework.
## Supported File Types 📚
| Source | Input types | Multimodal | Notes |
| ---------------------------- | ------------------------------------------------------------------------------------ | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Webpage | URLs starting with `http`, `https`, `ftp` | ✔️ | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI content extraction from the webpage's screenshot |
| PDF | `.pdf` | ✔️ | Extracts page markdown and page images. `ai_extraction` available to use a VLM for complex or scanned documents |
| Word Document | `.docx` | ✔️ | Extracts text, tables, and images |
| PowerPoint | `.pptx` | ✔️ | Extracts text and images from slides |
| Video | `.mp4`, `.mov`, `.wmv` | ✔️ | Uses Whisper for transcription and extracts frames |
| Audio | `.mp3`, `.wav` | ✔️ | Uses Whisper for transcription |
| Jupyter Notebook | `.ipynb` | ✔️ | Extracts markdown, code, outputs, and images |
| Spreadsheet | `.csv`, `.xls`, `.xlsx` | ❌ | Converts each row to JSON format, including row index for each |
| Plaintext | `.txt`, `.md`, `.rtf`, etc | ❌ | Simple text extraction |
| Image | `.jpg`, `.jpeg`, `.png` | ✔️ | Uses VLM for OCR in text-only mode |
| ZIP File | `.zip` | ✔️ | Extracts and processes contained files |
| Directory | any `path/to/folder` | ✔️ | Recursively processes all files in directory. Optionally use `inclusion_pattern` to pass regex strings for file inclusion rules. |
| YouTube Video (known issues) | YouTube video URLs starting with `https://youtube.com` or `https://www.youtube.com`. | ✔️ | Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your `pytube` installation to send a valid user agent header (see [this issue](https://github.com/pytube/pytube/issues/399)). |
| Tweet | URLs starting with `https://twitter.com` or `https://x.com` | ✔️ | Uses unofficial API, may break unexpectedly |
| GitHub Repository | GitHub repo URLs starting with `https://github.com` or `https://www.github.com` | ✔️ | Requires GITHUB_TOKEN environment variable |
## Configuration & Environment
Set these environment variables to control API keys, hosting, and model defaults:
```bash
# If you want longer-term image storage and hosting (saves to ./images and serves via HOST_URL)
export HOST_IMAGES=true
# GitHub token for scraping private/public repos via `scrape_url`
export GITHUB_TOKEN=ghp_...
# Base URL + key for any custom LLM server (used in extract/scrape)
export LLM_SERVER_BASE_URL=https://openrouter.ai/api/v1
export LLM_SERVER_API_KEY=or-...
# Control scraping defaults
export DEFAULT_AI_MODEL=gpt-4o
export FILESIZE_LIMIT_MB=50
```
## CLI Reference
```shell
# Basic usage: scrape a file or URL
thepipe <source> [options]
# Options:
--ai_extraction Use AI for PDF/image/text extraction
--text_only Only output text (no images)
--inclusion_pattern=REGEX Only include files matching REGEX when scraping directories
--verbose Print detailed progress messages
```
## Contributing
We welcome contributions! To get started:
1. Fork the repo and create a feature branch:
```bash
git checkout -b feature/my-new-feature
```
2. Install dependencies & run tests:
```bash
pip install -r requirements.txt
python -m unittest discover
```
3. Make your changes, format them, and commit them:
```bash
black .
git add .
git commit -m "..."
```
4. Push to your fork and create a pull request:
```bash
git push origin feature/my-new-feature
```
5. Submit a pull request to the main repository.
6. Wait for review and feedback from the maintainers. This may take some time, so please be patient!
", Assign "at most 3 tags" to the expected json: {"id":"8905","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"