base on 🦜⛏️ Did you say you like data? **Archived** This is an example repo that is no longer maintained, but you can use it as a reference for implementing an extraction service with a UI. Do not use code from `main`. Instead please checkout code from [releases](https://github.com/langchain-ai/langchain-extract/releases) This repository is not a library, but a jumping point for your own application -- so do not be surprised to find breaking changes between releases! # 🦜⛏️ LangChain Extract https://github.com/langchain-ai/langchain-extract/assets/26529506/6657280e-d05f-4c0f-9c47-07a0ef7c559d [![CI](https://github.com/langchain-ai/langchain-extract/actions/workflows/ci.yml/badge.svg)](https://github.com/langchain-ai/langchain-extract/actions/workflows/ci.yml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/langchainai.svg?style=social&label=Follow%20%40LangChainAI)](https://twitter.com/langchainai) [![](https://dcbadge.vercel.app/api/server/6adMQxSpJS?compact=true&style=flat)](https://discord.gg/6adMQxSpJS) [![Open Issues](https://img.shields.io/github/issues-raw/langchain-ai/langchain-extract)](https://github.com/langchain-ai/langchain-extract/issues) `langchain-extract` is a simple web server that allows you to extract information from text and files using LLMs. It is build using [FastAPI](https://fastapi.tiangolo.com/), [LangChain](https://python.langchain.com/) and [Postgresql](https://www.postgresql.org/). The backend closely follows the [extraction use-case documentation](https://python.langchain.com/docs/use_cases/extraction) and provides a reference implementation of an app that helps to do extraction over data using LLMs. This repository is meant to be a starting point for building your own extraction application which may have slightly different requirements or use cases. ## Functionality - 🚀 FastAPI webserver with a REST API - 📚 OpenAPI Documentation - 📝 Use [JSON Schema](https://json-schema.org/) to define what to extract - 📊 Use examples to improve the quality of extracted results - 📦 Create and save extractors and examples in a database - 📂 Extract information from text and/or binary files - 🦜️🏓 [LangServe](https://github.com/langchain-ai/langserve) endpoint to integrate with LangChain `RemoteRunnnable` ## Releases: 0.0.1: https://github.com/langchain-ai/langchain-extract/releases/tag/0.0.1 0.0.2: https://github.com/langchain-ai/langchain-extract/releases/tag/0.0.2 ## 📚 Documentation See the example notebooks in the [documentation](https://github.com/langchain-ai/langchain-extract/tree/main/docs/source/notebooks) to see how to create examples to improve extraction results, upload files (e.g., HTML, PDF) and more. Documentation and server code are both under development! ## 🍯 Example API Below are two sample `curl` requests to demonstrate how to use the API. These only provide minimal examples of how to use the API, see the [documentation](https://github.com/langchain-ai/langchain-extract/tree/main/docs/source/notebooks) for more information about the API and the [extraction use-case documentation](https://python.langchain.com/docs/use_cases/extraction) for more information about how to extract information using LangChain. First we generate a user ID for ourselves. **The application does not properly manage users or include legitimate authentication**. Access to extractors, few-shot examples, and other artifacts is controlled via this ID. Consider it secret. ```sh USER_ID=$(uuidgen) export USER_ID ``` ### Create an extractor ```sh curl -X 'POST' \ 'http://localhost:8000/extractors' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H "x-key: ${USER_ID}" \ -d '{ "name": "Personal Information", "description": "Use to extract personal information", "schema": { "type": "object", "title": "Person", "required": [ "name", "age" ], "properties": { "age": { "type": "integer", "title": "Age" }, "name": { "type": "string", "title": "Name" } } }, "instruction": "Use information about the person from the given user input." }' ``` Response: ```json { "uuid": "e07f389f-3577-4e94-bd88-6b201d1b10b9" } ``` Use the extract endpoint to extract information from the text (or a file) using an existing pre-defined extractor. ```sh curl -s -X 'POST' \ 'http://localhost:8000/extract' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H "x-key: ${USER_ID}" \ -F 'extractor_id=e07f389f-3577-4e94-bd88-6b201d1b10b9' \ -F 'text=my name is chester and i am 20 years old. My name is eugene and I am 1 year older than chester.' \ -F 'mode=entire_document' \ -F 'file=' | jq . ``` Response: ```json { "data": [ { "name": "chester", "age": 20 }, { "name": "eugene", "age": 21 } ] } ``` Add a few shot example: ```sh curl -X POST "http://localhost:8000/examples" \ -H "Content-Type: application/json" \ -H "x-key: ${USER_ID}" \ -d '{ "extractor_id": "e07f389f-3577-4e94-bd88-6b201d1b10b9", "content": "marcos is 10.", "output": [ { "name": "MARCOS", "age": 10 } ] }' | jq . ``` The response will contain a UUID for the example. Examples can be deleted with a DELETE request. This example is now persisted and associated with our extractor, and subsequent extraction runs will incorporate it. ## ✅ Running locally The easiest way to get started is to use `docker-compose` to run the server. **Configure the environment** Add `.local.env` file to the root directory with the following content: ```sh OPENAI_API_KEY=... # Your OpenAI API key ``` Adding `FIREWORKS_API_KEY` or `TOGETHER_API_KEY` to this file would enable additional models. You can access available models for the server and other information via a `GET` request to the `configuration` endpoint. Build the images: ```sh docker compose build ``` Run the services: ```sh docker compose up ``` This will launch both the extraction server and the postgres instance. Verify that the server is running: ```sh curl -X 'GET' 'http://localhost:8000/ready' ``` This should return `ok`. The UI will be available at [http://localhost:3000](http://localhost:3000). ## Contributions Feel free to develop in this project for your own needs! For now, we are not accepting pull requests, but would love to hear [questions, ideas or issues](https://github.com/langchain-ai/langchain-extract/discussions). ## Development To set up for development, you will need to install [Poetry](https://python-poetry.org/). The backend code is located in the `backend` directory. ```sh cd backend ``` Set up the environment using poetry: ```sh poetry install --with lint,dev,test ``` Run the following script to create a database and schema: ```sh python -m scripts.run_migrations create ``` From `/backend`: ```sh OPENAI_API_KEY=[YOUR API KEY] python -m server.main ``` ### Testing Create a test database. The test database is used for running tests and is separate from the main database. It will have the same schema as the main database. ```sh python -m scripts.run_migrations create-test-db ``` Run the tests ```sh make test ``` ### Linting and format Testing and formatting is done using a Makefile inside `[root]/backend` ```sh make format ``` ", Assign "at most 3 tags" to the expected json: {"id":"8916","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"