Trendshift - Ask AI

base on Generative AI extensions for onnxruntime # ONNX Runtime GenAI Note: between `v0.11.0` and `v0.10.1`, there is a breaking API usage change to improve model quality during multi-turn conversations. Previously, the decoding loop could be written as follows. ``` while not IsDone(): GenerateToken() GetLastToken() PrintLastToken() ``` In 0.11.0, the decoding loop should now be written as follows. ``` while True: GenerateToken() if IsDone(): break GetLastToken() PrintLastToken() ``` Please read [this PR's description](https://github.com/microsoft/onnxruntime-genai/pull/1849) for more information. ## Status [![Latest version](https://img.shields.io/nuget/vpre/Microsoft.ML.OnnxRuntimeGenAI.Managed?label=latest)](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI.Managed/absoluteLatest) [![Nightly Build](https://github.com/microsoft/onnxruntime-genai/actions/workflows/linux-cpu-x64-nightly-build.yml/badge.svg)](https://github.com/microsoft/onnxruntime-genai/actions/workflows/linux-cpu-x64-nightly-build.yml) ## Description Run generative AI models with ONNX Runtime. This API gives you an easy, flexible and performant way of running LLMs on device. It implements the generative AI loop for ONNX models, including pre and post processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. See documentation at the [ONNX Runtime website](https://onnxruntime.ai/docs/genai) for more details. |Support matrix|Supported now|Under development|On the roadmap| | -------------- | ------------- | ----------------- | -------------- | | Model architectures | AMD OLMo ChatGLM DeepSeek ERNIE 4.5 Gemma gpt-oss Granite Llama Mistral Nemotron Phi (language + vision) Qwen SmolLM3 Whisper | Stable diffusion | Multi-modal models | | API| Python C# C/C++ Java ^ | Objective-C || | Platform | Linux Windows Mac ^ Android ^ || iOS ||| | Architecture | x86 x64 Arm64 ~ |||| | Hardware Acceleration | CPU CUDA DirectML NvTensorRtRtx (TRT-RTX) OpenVINO QNN WebGPU | | AMD GPU | | Features | Multi-LoRA Continuous decoding Constrained decoding | | Speculative decoding | \~ Windows builds available, requires build from source for other platforms ## Installation See [installation instructions](https://onnxruntime.ai/docs/genai/howto/install) or [build from source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html) ## Sample code for Phi-3 in Python 1. Download the model ```shell huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir . ``` 2. Install the API ```shell pip install numpy pip install --pre onnxruntime-genai ``` 3. Run the model ```python import onnxruntime_genai as og model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4') tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() # Set the max length to something sensible by default, # since otherwise it will be set to the entire context length search_options = {} search_options['max_length'] = 2048 search_options['batch_size'] = 1 chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>' text = input("Input: ") if not text: print("Error, input cannot be empty") exit() prompt = f'{chat_template.format(input=text)}' input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) print("Output: ", end='', flush=True) try: generator.append_tokens(input_tokens) while True: generator.generate_next_token() if generator.is_done(): break new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end='', flush=True) except KeyboardInterrupt: print(" --control+c pressed, aborting generation--") print() del generator ``` ### Choosing the Right Examples: Release vs. Main Branch Due to the evolving nature of this project and ongoing feature additions, examples in the `main` branch may not always align with the latest stable release. This section outlines how to ensure compatibility between the examples and the corresponding version. The majority of the steps would remain same. Just the package installation and the model example file would change. ### Stable version Install the package according to the [installation instructions](https://onnxruntime.ai/docs/genai/howto/install). Let's say you installed the 0.10.1 version of ONNX Runtime GenAI, so the instructions would look like this: ```bash # Clone the repo git clone https://github.com/microsoft/onnxruntime-genai.git && cd onnxruntime-genai # Checkout the branch for the version you are using git checkout v0.10.1 cd examples ``` ### Nightly version (Main Branch) Build the package from source using these [instructions](https://onnxruntime.ai/docs/genai/howto/build-from-source.html). Now just go to the folder location where all the examples are present. ```bash # Clone the repo git clone https://github.com/microsoft/onnxruntime-genai.git && cd onnxruntime-genai cd examples ``` ## Roadmap See the [Discussions](https://github.com/microsoft/onnxruntime-genai/discussions) to request new features and up-vote existing requests. ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [[email protected]](mailto:[email protected]) with any additional questions or comments. ### Linting This project enables [lintrunner](https://github.com/suo/lintrunner) for linting. You can install the dependencies and initialize with ```sh pip install -r requirements-lintrunner.txt lintrunner init ``` This will install lintrunner on your system and download all the necessary dependencies to run linters locally. To format local changes: ```bash lintrunner -a ``` To format all files: ```bash lintrunner -a --all-files ``` ## Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. ", Assign "at most 3 tags" to the expected json: {"id":"9669","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts