AI prompts
base on Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections. # LLM-Aided OCR Project
## Introduction
The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.
## Example Outputs
To see what the LLM-Aided OCR Project can do, check out these example outputs:
- [Original PDF](https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main/160301289-Warren-Buffett-Katharine-Graham-Letter.pdf)
- [Raw OCR Output](https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main/160301289-Warren-Buffett-Katharine-Graham-Letter__raw_ocr_output.txt)
- [LLM-Corrected Markdown Output](https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main/160301289-Warren-Buffett-Katharine-Graham-Letter_llm_corrected.md)
## Features
- PDF to image conversion
- OCR using Tesseract
- Advanced error correction using LLMs (local or API-based)
- Smart text chunking for efficient processing
- Markdown formatting option
- Header and page number suppression (optional)
- Quality assessment of the final output
- Support for both local LLMs and cloud-based API providers (OpenAI, Anthropic)
- Asynchronous processing for improved performance
- Detailed logging for process tracking and debugging
- GPU acceleration for local LLM inference
## Detailed Technical Overview
### PDF Processing and OCR
1. **PDF to Image Conversion**
- Function: `convert_pdf_to_images()`
- Uses `pdf2image` library to convert PDF pages into images
- Supports processing a subset of pages with `max_pages` and `skip_first_n_pages` parameters
2. **OCR Processing**
- Function: `ocr_image()`
- Utilizes `pytesseract` for text extraction
- Includes image preprocessing with `preprocess_image()` function:
- Converts image to grayscale
- Applies binary thresholding using Otsu's method
- Performs dilation to enhance text clarity
### Text Processing Pipeline
1. **Chunk Creation**
- The `process_document()` function splits the full text into manageable chunks
- Uses sentence boundaries for natural splits
- Implements an overlap between chunks to maintain context
2. **Error Correction and Formatting**
- Core function: `process_chunk()`
- Two-step process:
a. OCR Correction:
- Uses LLM to fix OCR-induced errors
- Maintains original structure and content
b. Markdown Formatting (optional):
- Converts text to proper markdown format
- Handles headings, lists, emphasis, and more
3. **Duplicate Content Removal**
- Implemented within the markdown formatting step
- Identifies and removes exact or near-exact repeated paragraphs
- Preserves unique content and ensures text flow
4. **Header and Page Number Suppression (Optional)**
- Can be configured to remove or distinctly format headers, footers, and page numbers
### LLM Integration
1. **Flexible LLM Support**
- Supports both local LLMs and cloud-based API providers (OpenAI, Anthropic)
- Configurable through environment variables
2. **Local LLM Handling**
- Function: `generate_completion_from_local_llm()`
- Uses `llama_cpp` library for local LLM inference
- Supports custom grammars for structured output
3. **API-based LLM Handling**
- Functions: `generate_completion_from_claude()` and `generate_completion_from_openai()`
- Implements proper error handling and retry logic
- Manages token limits and adjusts request sizes dynamically
4. **Asynchronous Processing**
- Uses `asyncio` for concurrent processing of chunks when using API-based LLMs
- Maintains order of processed chunks for coherent final output
### Token Management
1. **Token Estimation**
- Function: `estimate_tokens()`
- Uses model-specific tokenizers when available
- Falls back to `approximate_tokens()` for quick estimation
2. **Dynamic Token Adjustment**
- Adjusts `max_tokens` parameter based on prompt length and model limits
- Implements `TOKEN_BUFFER` and `TOKEN_CUSHION` for safe token management
### Quality Assessment
1. **Output Quality Evaluation**
- Function: `assess_output_quality()`
- Compares original OCR text with processed output
- Uses LLM to provide a quality score and explanation
### Logging and Error Handling
- Comprehensive logging throughout the codebase
- Detailed error messages and stack traces for debugging
- Suppresses HTTP request logs to reduce noise
## Configuration and Customization
The project uses a `.env` file for easy configuration. Key settings include:
- LLM selection (local or API-based)
- API provider selection
- Model selection for different providers
- Token limits and buffer sizes
- Markdown formatting options
## Output and File Handling
1. **Raw OCR Output**: Saved as `{base_name}__raw_ocr_output.txt`
2. **LLM Corrected Output**: Saved as `{base_name}_llm_corrected.md` or `.txt`
The script generates detailed logs of the entire process, including timing information and quality assessments.
## Requirements
- Python 3.12+
- Tesseract OCR engine
- PDF2Image library
- PyTesseract
- OpenAI API (optional)
- Anthropic API (optional)
- Local LLM support (optional, requires compatible GGUF model)
## Installation
1. Install Pyenv and Python 3.12 (if needed):
```bash
# Install Pyenv and python 3.12 if needed and then use it to create venv:
if ! command -v pyenv &> /dev/null; then
sudo apt-get update
sudo apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python3-openssl git
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init --path)"' >> ~/.zshrc
source ~/.zshrc
fi
cd ~/.pyenv && git pull && cd -
pyenv install 3.12
```
2. Set up the project:
```bash
# Use pyenv to create virtual environment:
git clone https://github.com/Dicklesworthstone/llm_aided_ocr
cd llm_aided_ocr
pyenv local 3.12
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install wheel
python -m pip install --upgrade setuptools wheel
pip install -r requirements.txt
```
3. Install Tesseract OCR engine (if not already installed):
- For Ubuntu: `sudo apt-get install tesseract-ocr`
- For macOS: `brew install tesseract`
- For Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
4. Set up your environment variables in a `.env` file:
```
USE_LOCAL_LLM=False
API_PROVIDER=OPENAI
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
```
## Usage
1. Place your PDF file in the project directory.
2. Update the `input_pdf_file_path` variable in the `main()` function with your PDF filename.
3. Run the script:
```
python llm_aided_ocr.py
```
4. The script will generate several output files, including the final post-processed text.
## How It Works
The LLM-Aided OCR project employs a multi-step process to transform raw OCR output into high-quality, readable text:
1. **PDF Conversion**: Converts input PDF into images using `pdf2image`.
2. **OCR**: Applies Tesseract OCR to extract text from images.
3. **Text Chunking**: Splits the raw OCR output into manageable chunks for processing.
4. **Error Correction**: Each chunk undergoes LLM-based processing to correct OCR errors and improve readability.
5. **Markdown Formatting** (Optional): Reformats the corrected text into clean, consistent Markdown.
6. **Quality Assessment**: An LLM-based evaluation compares the final output quality to the original OCR text.
## Code Optimization
- **Concurrent Processing**: When using API-based models, chunks are processed concurrently to improve speed.
- **Context Preservation**: Each chunk includes a small overlap with the previous chunk to maintain context.
- **Adaptive Token Management**: The system dynamically adjusts the number of tokens used for LLM requests based on input size and model constraints.
## Configuration
The project uses a `.env` file for configuration. Key settings include:
- `USE_LOCAL_LLM`: Set to `True` to use a local LLM, `False` for API-based LLMs.
- `API_PROVIDER`: Choose between "OPENAI" or "CLAUDE".
- `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`: API keys for respective services.
- `CLAUDE_MODEL_STRING`, `OPENAI_COMPLETION_MODEL`: Specify the model to use for each provider.
- `LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS`: Set the context size for local LLMs.
## Output Files
The script generates several output files:
1. `{base_name}__raw_ocr_output.txt`: Raw OCR output from Tesseract.
2. `{base_name}_llm_corrected.md`: Final LLM-corrected and formatted text.
## Limitations and Future Improvements
- The system's performance is heavily dependent on the quality of the LLM used.
- Processing very large documents can be time-consuming and may require significant computational resources.
## Contributing
Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.
## License
This project is licensed under the MIT License.
", Assign "at most 3 tags" to the expected json: {"id":"11430","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"