AI prompts
base on ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows. <p align="center">
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512" alt="Extract Thinker Logo" width="200"/>
</p>
<p align="center">
<img alt="Python Version" src="https://img.shields.io/badge/Python-3.9%2B-blue.svg" />
<a href="https://medium.com/@enoch3712">
<img alt="Medium" src="https://img.shields.io/badge/Medium-12100E?style=flat&logo=medium&logoColor=white" />
</a>
<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM" />
<img alt="Github License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" />
</p>
# ExtractThinker
ExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.
**TL;DR Document Intelligence for LLMs**
## 🚀 Key Features
- **Flexible Document Loaders**: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.
- **Customizable Contracts**: Define custom extraction contracts using Pydantic models for precise data extraction.
- **Advanced Classification**: Classify documents or document sections using custom classifications and strategies.
- **Asynchronous Processing**: Utilize asynchronous processing for efficient handling of large documents.
- **Multi-format Support**: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.
- **ORM-style Interaction**: Interact with documents and LLMs in an ORM-like fashion for intuitive development.
- **Splitting Strategies**: Implement lazy or eager splitting strategies to process documents page by page or as a whole.
- **Integration with LLMs**: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.
- **Community-driven Development**: Inspired by the LangChain ecosystem with a focus on intelligent document processing.

## 📦 Installation
Install ExtractThinker using pip:
```bash
pip install extract_thinker
```
## 🛠️ Usage
### Basic Extraction Example
Here's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using PyPdf and extract specific fields defined in a contract.
```python
import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract
load_dotenv()
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
# Set the path to your Tesseract executable
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")
# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini") # or any other supported model
# Extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
```
### Classification Example
ExtractThinker allows you to classify documents or parts of documents using custom classifications:
```python
import os
from dotenv import load_dotenv
from extract_thinker import (
Extractor, Classification, Process, ClassificationStrategy,
DocumentLoaderPyPdf, Contract
)
load_dotenv()
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
class DriverLicenseContract(Contract):
name: str
license_number: str
# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")
# Define classifications
classifications = [
Classification(
name="Invoice",
description="An invoice document",
contract=InvoiceContract,
extractor=extractor,
),
Classification(
name="Driver License",
description="A driver's license document",
contract=DriverLicenseContract,
extractor=extractor,
),
]
# Classify the document directly using the extractor
result = extractor.classify(
"path_to_your_document.pdf", # Can be a file path or IO stream
classifications,
image=True # Set to True for image-based classification
)
# The result will be a ClassificationResponse object with 'name' and 'confidence' fields
print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")
```
### Splitting Files Example
ExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.
```python
import os
from dotenv import load_dotenv
from extract_thinker import (
Extractor,
Process,
Classification,
ImageSplitter,
DocumentLoaderTesseract,
Contract,
SplittingStrategy,
)
load_dotenv()
class DriverLicenseContract(Contract):
name: str
license_number: str
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")
# Define classifications
classifications = [
Classification(
name="Driver License",
description="A driver's license document",
contract=DriverLicenseContract,
extractor=extractor,
),
Classification(
name="Invoice",
description="An invoice document",
contract=InvoiceContract,
extractor=extractor,
),
]
# Initialize the process and load the splitter
process = Process()
process.load_document_loader(DocumentLoaderPyPdf())
process.load_splitter(ImageSplitter(model="gpt-4o-mini"))
# Load and process the document
path_to_document = "path_to_your_multipage_document.pdf"
split_content = (
process.load_file(path_to_document)
.split(classifications, strategy=SplittingStrategy.LAZY)
.extract()
)
# Process the extracted content as needed
for item in split_content:
if isinstance(item, InvoiceContract):
print("Extracted Invoice:")
print("Invoice Number:", item.invoice_number)
print("Invoice Date:", item.invoice_date)
elif isinstance(item, DriverLicenseContract):
print("Extracted Driver License:")
print("Name:", item.name)
print("License Number:", item.license_number)
```
### Batch Processing Example
You can also perform batch processing of documents:
```python
from extract_thinker import Extractor, Contract
class ReceiptContract(Contract):
store_name: str
total_amount: float
extractor = Extractor()
extractor.load_llm("gpt-4o-mini")
# List of file paths or streams
document = "receipt1.jpg"
batch_job = extractor.extract_batch(
source=document,
response_model=ReceiptContract,
vision=True,
)
# Monitor the batch job status
print("Batch Job Status:", await batch_job.get_status())
# Retrieve results once processing is complete
results = await batch_job.get_result()
for result in results.parsed_results:
print("Store Name:", result.store_name)
print("Total Amount:", result.total_amount)
```
### Local LLM Integration Example
ExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:
```python
from extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
# Load a custom LLM (e.g., Ollama)
os.environ['API_BASE'] = "http://localhost:11434"
llm = LLM('ollama/phi3')
extractor.load_llm(llm)
# Extract data
result = extractor.extract("invoice.png", InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
```
## 📚 Documentation and Resources
- **Examples**: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.
- **Medium Articles**: Read articles about ExtractThinker on the author's Medium page.
- **Test Suite**: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.
## 🧩 Integration with LLM Providers
ExtractThinker supports integration with multiple LLM providers:
- **OpenAI**: Use models like gpt-3.5-turbo, gpt-4, etc.
- **Anthropic**: Integrate with Claude models.
- **Cohere**: Utilize Cohere's language models.
- **Azure OpenAI**: Connect with Azure's OpenAI services.
- **Local Models**: Ollama compatible models.
## ⚙️ How It Works
ExtractThinker uses a modular architecture inspired by the LangChain ecosystem:
- **Document Loaders**: Responsible for loading and preprocessing documents from various sources and formats.
- **Extractors**: Orchestrate the interaction between the document loaders and LLMs to extract structured data.
- **Splitters**: Implement strategies to split documents into manageable chunks for processing.
- **Contracts**: Define the expected structure of the extracted data using Pydantic models.
- **Classifications**: Classify documents or document sections to apply appropriate extraction contracts.
- **Processes**: Manage the workflow of loading, classifying, splitting, and extracting data from documents.

## 📝 Why Use ExtractThinker?
While general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:
- **Specialized Components**: Tailored tools for document loading, splitting, and extraction.
- **High Accuracy with LLMs**: Leverages the power of LLMs to improve the accuracy of data extraction and classification.
- **Ease of Use**: Intuitive APIs and ORM-style interactions reduce the learning curve.
- **Community Support**: Active development and support from the community.
## 🤝 Contributing
We welcome contributions from the community! To contribute:
1. Fork the repository
2. Create a new branch for your feature or bugfix
3. Write tests for your changes
4. Run tests to ensure everything is working correctly
5. Submit a pull request with a description of your changes
## 🌟 Community and Support
Stay updated and connect with the community:
- [Scaling Document Extraction with o1, GPT-4o & Mini](https://medium.com/towards-artificial-intelligence/scaling-document-extraction-with-o1-gpt4o-and-mini-extractthinker-8f3340b4e69c)
- [Claude 3.5 — The King of Document Intelligence](https://medium.com/gitconnected/claude-3-5-the-king-of-document-intelligence-f57bea1d209d?sk=124c5abb30c0e7f04313c5e20e79c2d1)
- [Classification Tree for LLMs](https://medium.com/gitconnected/classification-tree-for-llms-32b69015c5e0?sk=8a258cf74fe3483e68ab164e6b3aaf4c)
- [Advanced Document Classification with LLMs](https://medium.com/gitconnected/advanced-document-classification-with-llms-8801eaee3c58?sk=f5a22ee72022eb70e112e3e2d1608e79)
- [Phi-3 and Azure: PDF Data Extraction | ExtractThinker](https://medium.com/towards-artificial-intelligence/phi-3-and-azure-pdf-data-extraction-extractthinker-cb490a095adb?sk=7be7e625b8f9932768442f87dd0ebcec)
- [ExtractThinker: Document Intelligence for LLMs](https://medium.com/towards-artificial-intelligence/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef)
## 📄 License
This project is licensed under the Apache License 2.0. See the LICENSE file for more details.
## Contact
For any questions or issues, please open an issue on the GitHub repository or reach out via email.
", Assign "at most 3 tags" to the expected json: {"id":"12774","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"