base on Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch. # GPTFast Accelerate your Hugging Face Transformers 7.6-9x with GPTFast! # Background [GPTFast](https://github.com/pytorch-labs/gpt-fast) was originally a set of techniques developed by the PyTorch Team to accelerate the inference speed of Llama-2-7b. This pip package generalizes those techniques to all Hugging Face models. # Demo GPTFast Inference Time|Eager Inference Time --|-- ![](https://github.com/MDK8888/GPTFast/assets/79173446/4d7ed04e-ba3d-49c7-aeca-8f2b96ac45a8)|![](https://github.com/MDK8888/GPTFast/assets/79173446/1a4f2236-d2f4-42c7-a689-553482871905) # Roadmap - ⟳ 0.7.x (xx/xx/xx): Medusa, Speculative Sampling, Eagle - ⟳ 0.6.x (xx/xx/xx): BitNet and 1-bit quantization, AWQ, QoQ, GGUF, HQQ - ⟳ 0.5.x (xx/xx/xx): PagedAttention (vLLM) + FlashAttention integration - ⟳ 0.4.x (xx/xx/xx): Tensor parallelism + GPU distributed inference - ✅ 0.3.x (06/20/24): GPTQ int4 quantization and optimized int4 matmul kernels enabled for all HF models (**9x inference acceleration**) - ✅ 0.2.x (04/02/24): static key-value cache enabled for all HF models (**8.5x inference acceleration**) - ✅ 0.1.x (02/22/24): torch.compile, int8 quantization, speculative decoding (**7x inference acceleration**) # Getting Started ## WARNING: The below documentation is now deprecated with version 0.3.0. New docs will be up soon! ## * Make sure that your python version >= 3.10, and you are on a cuda enabled device. * Make a virtual environment on your machine and activate it. ```bash $python3 -m venv VENV_NAME source VENV_NAME/bin/activate #./VENV_NAME/scripts/activate if you are on Windows ``` * Call the following: ```pip install gptfast``` * Copy the following code into a python file: ```python import os import torch from transformers import AutoTokenizer from GPTFast.Core import gpt_fast from GPTFast.Helpers import timed torch._dynamo.reset() os.environ["TOKENIZERS_PARALLELISM"] = "false" device = "cuda" if torch.cuda.is_available() else "cpu" def argmax_variation(self, probabilities:torch.Tensor, temperature:float = 1, k:int = 5): # Apply temperature scaling device = probabilities.device scaled_probabilities = probabilities / temperature # Ensure k is within a valid range k = min(k, probabilities.size(-1)) # Get the indices of the top-k scaled probabilities along the specified dimension top_k_indices = torch.topk(scaled_probabilities, k, dim=-1).indices # Generate random indices for sampling random_indices = torch.randint(0, k, (1,) * probabilities.dim()).to(device) # Use gathered indices to get the final sampled token sampled_token = top_k_indices.gather(-1, random_indices).to(device) return sampled_token.unsqueeze(0) def argmax(self, probabilities): # Use argmax to get the token with the maximum probability max_prob_index = torch.argmax(probabilities, dim=-1) return max_prob_index.view(1, 1) model_name = "gpt2-xl" draft_model_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) initial_string = "Write me a short story." input_tokens = tokenizer.encode(initial_string, return_tensors="pt").to(device) N_ITERS=10 MAX_TOKENS=50 cache_config = { "model_config": { "path_to_blocks": ["transformer", "h"], "child_ref_in_parent_forward": ["transformer", "block"], }, "block_config": { "path_to_attn": ["attn"], "child_ref_in_parent_forward": ["attn"], }, "attn_config": { "cache_update_config":{ "kv_cache_condition":"if layer_past is not None", "key_name": "key", "value_name": "value", }, "causal_mask_config": { "causal_mask_application": "conditional", "causal_mask_method": "_attn", "causal_mask_condition": "not self.is_cross_attention" } }, "imports": ["import torch", "import transformers", "from transformers import *", "from torch import *", "from typing import *", "import types", "from transformers.modeling_outputs import *", "from torch import nn"] } gpt_fast_model = gpt_fast(model_name, sample_function=argmax, max_length=60, cache_config=cache_config, draft_model_name=draft_model_name) gpt_fast_model.to(device) fast_compile_times = [] for i in range(N_ITERS): with torch.no_grad(): res, compile_time = timed(lambda: gpt_fast_model.generate(cur_tokens=input_tokens, max_tokens=MAX_TOKENS, speculate_k=6)) fast_compile_times.append(compile_time) print(f"gpt fast eval time {i}: {compile_time}") print("~" * 10) print(tokenizer.decode(res[0])) ``` * Run it and watch the magic 🪄! # Documentation At its core, this library provides a simple interface to LLM Inference acceleration techniques. All of the following functions can be imported from ```GPTFast.Core```: * ```gpt_fast(model_name:str, sample_function:Callable[torch.Tensor, Dict[str, Any], torch.Tensor], max_length:int, cache_config:dict, draft_model_name:str) -> torch.nn.Module``` * **Parameters**: * ```model_name```: This is the name of the Hugging Face model that you want to optimize. * ```sample_function```: This is a function which will take in a PyTorch Tensor which takes in a pytorch tensor as a first argument among other **sampling_kwargs and returns a Tensor of shape (1, 1). * ```max_length```: This is an int specifying up to how many tokens you will generating. It is recommended that you set this value higher than how many tokens you will actually generated. * ```cache_config```: This is a dictionary which will specify how the static key-value cache will be integrated into the model. More details for this dictionary follow below. * ```draft_model_name```: This is an **optional** argument which is the name of the Hugging Face draft model which is needed for [speculative decoding](https://arxiv.org/abs/2211.17192). Note that the model and the draft model must both use the same tokenizer, and the draft model must be **significantly** smaller to achieve inference acceleration. If ```draft_model_name``` is not specified, speculative decoding will not be applied to your model. * **Returns**: * An accelerated model with one method: * ```generate(self, cur_tokens:torch.Tensor, max_tokens:int, speculate_k:int, **sampling_kwargs) -> torch.Tensor``` * **Parameters**: * ```cur_tokens```: A PyTorch Tensor of size (1, seq_len). * ```max_tokens```: An int representing how many tokens you want to generate. * ```speculate_k```: An int specifying how far you want the draft model to speculate in speculative decoding. * ```**sampling_kwargs```: Additional parameters that are necessary for sampling from the distribution. Should match the ```**sampling_kwargs``` of ```sample_function``` above. * **Returns**: * The generated tokens to your prompt, a tensor with dimensions ```(1, max_tokens)```. *** * ```load_int8(model_name:str) -> torch.nn.Module``` * **Parameters**: * ```model_name```: This is a string specifying the model that you are using. * **Returns**: * An ```int8``` quantized version of your model. *** * ```add_kv_cache(transformer:nn.Module, sampling_fn:Callable[torch.Tensor, Dict[str, Any], torch.Tensor], max_length:int, cache_config:dict) -> KVCacheModel``` * **Parameters**: * ```transformer```: This is the Hugging Face model that you are adding a static key-value cache to. * ```sampling_fn```: This is the same as the ```sampling_function``` paramter for the ```gpt_fast``` function. * ```max_length```: This is the same as the ```max_length``` paramter for the ```gpt_fast``` function. * ```cache_config```: This is a dictionary which will specify how you **directly modify the source code of the forward pass of the model** so that a static cache can be accomadated. The full specifications for this dictionary are below: ``` -model_config: this defines how your model should be modified to accomodate a static kv cache. -path_to_blocks (list[str]): Starting from the model itself, this defines the child attributes on a parent ```nn.Module``` attribute/object that we access to reach the blocks of a transformer. -child_ref_in_parent_forward (list[str]): starting from the original model, this is how each child module/attribute in ```path_to_blocks``` is referenced in the forward pass of the parent module/attribute. -block_config: this defines how your block needs to be modified to accomodate a static kv cache. -path_to_attn (list[str]): Starting from the block itself, this defines the child attributes on a parent ```nn.Module``` attribute/object that we access to reach the attention layer itself. -child_ref_in_parent_forward (list[str]): starting from the block, this is how each child module/attribute in path_to_attn is referenced in the forward pass of the parent module/attribute. -attn_config: this defines how the attention layer needs to be modified to accomodate a static kv cache. -cache_update_config: this defines how the key-value cache updates will be modified now that it is static. - kv_cache_condition (str): the condition under which a kv cache update is triggered in the source code of the original forward pass of the attention layer, typically something like "if layer_past is not None." - key_name (str): how the keys are originally referenced pre-update - value_name (str): how the values are originally referenced pre-update - new_key_name (Optional[str]): how the keys are referenced post_update. If this is not specified, this will simply be key_name. - new_value_name (Optional[str]): how the keys are referenced post_update. If this is not specified, this will simply be value_name. -causal_mask_config: this defines how the causal mask is applied - this is necessary because your keys and values now have length ```max_length``` along the second-to-last dimension. - causal_mask_module (str): the method of the attention layer where the causal mask is applied. - causal_mask_application (Union["conditional", Any]): this is either the string "conditional" or some other value. - if causal_mask_application is "conditional", you need to add the following additional keys: - causal_mask_condition (str): the condition under which the causal_mask is applied. - if it's not conditional, you need to add the following additional keys: - causal_mask_line (str): the starting line we want to replace. - num_lines (int): how many lines we want to replace including causal_mask_line -imports: these are the imports which are needed to compile your new functions after integrating a static kv cache. ``` * **Returns**: * An instance of the ```KVCacheModel``` class which is essentially just your model but with a key-value cache attached for accelerated inference. *** * ```add_speculative_decoding(model:nn.Module, draft_model:nn.Module) -> nn.Module``` * **Parameters**: * ```model```: This is the KVCached version of your model. * ```draft_model```: This is the KVCached version of your draft model. * **Returns**: * An accelerated model with the ```generate``` method described above under the ```gpt_fast``` section. ", Assign "at most 3 tags" to the expected json: {"id":"8118","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"