Starcoder2 Quantization Experiment

With Starcoder2, we make inference at different quantizations and interpret variables such as output quality, inference time and occupied GPU memory.

In this experiment, the "bigcode/starcoder2–15b-instruct-v0.1" model was run in various quantization formats such as 32bit multi GPU, 32-bit single GPU, 16bit single GPU, 16-bit brain float single GPU, 8-bit single GPU, 4-bit single GPU..

Inference code is prepared as stated on the starcoder2–15b huggingface page (8bit and 4bit were later changed): [https://huggingface.co/bigcode/starcoder2-15b]

After each model answer, the kernel was restarted and thus, it was seen that the GPU memory was completely empty, and then the next experiment was started.

The prompts given to the model were kept short, zero shots were made, no sampling was done to the models, and prompt engineering was not applied to the prompts to prevent the prompts from affecting the outputs.

What is the difference between FP16 and BF16?

ENG:

“While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were experiencing overflow issues while training the model, bf16 will prevent this from happening most of the time.”

[https://huggingface.co/docs/transformers/v4.16.2/en/performance]

Inference Codes

Importing the necessary libraries

Requirements.txt:

accelerate==0.28.0
bitsandbytes==0.43.0
transformers==4.39.0
huggingface-hub==0.21.4
python-multipart==0.0.7
torch==2.2.1

Imports:

import logging
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

Loading part of the model

32multi

In this structure, they work by dividing the GPU memory into exactly 2 halves.

os.environ["CUDA_VISIBLE_DEVICES"] = "4,5"
MODEL = "bigcode/starcoder2–15b-instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model", device_map="auto")

32single

os.environ["CUDA_VISIBLE_DEVICES"] = "5"
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model").to("cuda")

Fp16

os.environ["CUDA_VISIBLE_DEVICES"] = "5"
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model", device_map="auto", torch_dtype=torch.float16)

Bf16

os.environ["CUDA_VISIBLE_DEVICES"] = "5"
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model", device_map="auto", torch_dtype=torch.bfloat16)

8bit

A warning was received when inference was made with the models we quantized in 8bit and 4bit. When this warning was searched, the following explanation was found:

ENG:

“An important part of TensorFlow is that it is supposed to be fast. With a suitable installation, it works with CPUs, GPUs, or TPUs. Part of going fast means that it uses different code depending on your hardware. Some CPUs support operations that other CPUs do not, such as vectorized addition (adding multiple variables at once). Tensorflow is simply telling you that the version you have installed can use the AVX (Advanced Vector Extensions) and AVX2 operations and is set to do so by default in certain situations (say inside a forward or back-prop matrix multiply), which can speed things up. This is not an error, it is just telling you that it can and will take advantage of your CPU to get that extra speed out.”

Additionally, a ticket regarding the slow inference speed was opened at: [https://discuss.huggingface.co/t/correct-usage-of-bitsandbytesconfig/33809/4]. As suggested there, the “llm_int8_threshold=200.0” parameter was also added to the code. When this parameter was added to the code, the absurd inference time of 1.5 minutes decreased to 30-odd seconds. This time is still long, but it is not absurd.

os.environ["CUDA_VISIBLE_DEVICES"] = "5"
quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model", quantization_config=quantization_config, low_cpu_mem_usage= True)

4bit

Only when used as load_in_4bit=True: “Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). “This will lead to slow inference or training speed.” warning is received. This means that the model cannot be used with direct 4-bit float weights, so other parameters are also given. In the parameterized structure below, the model is loaded in 4 bits and then becomes 16 bits computeable.

os.environ["CUDA_VISIBLE_DEVICES"] = "5"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(MODEL, cache_dir="./model", quantization_config=quantization_config, low_cpu_mem_usage = True)

Modele isteğin atıldığı kısım

In this part, only the question changes in each experiment, apart from that the code piece remains the same. Sampling (temperature, top_k and top_p values) was not activated in order not to affect the experimental results.

tokenizer = AutoTokenizer.from_pretrained(MODEL)
inputs = tokenizer.encode("java staircase from stars", return_tensors="pt").to("cuda")
#outputs = model.generate(inputs, max_new_tokens=640, pad_token_id=tokenizer.eos_token_id, do_sample=True, temperature=0.1, top_k=50, top_p=0.95)
outputs = model.generate(inputs, max_length=512, pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)