본문 바로가기
Quantization을 하는 5가지 방법 #인공지능(LLM·Vision)

1. Model을 가져올 때 torch_dtype을 설정 

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

2. Model을 가져올 때 load_in_4bit를 설정 

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", 
                                             device_map="auto",
                                             load_in_4bit=True) # 여기에 명시

3. Model을 가져올 떄 Auto Config를 이용하는 방법 

config = AutoConfig.from_pretrained("model_name", torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("model_name", config=config)

4. Model을 가져올 떄 BitsAndBytesConfig를 이용하는 방법

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("model_name", quantization_config=bnb_config)

5. Pipe line을 만들 떄 torch_dtype을 설정 

pipeline = transformers.pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    model_kwargs={"torch_dtype": torch.bfloat16},  # 여기에 명시 하는 방법
)
 
* bfloat16은 float32와 동일한 8비트 지수를 가져 더 넓은 범위의 수를 표현할 수 있음/ bfloat16은 float32와 동일한 지수 비트를 가져 언더플로우나 오버플로우가 덜 발생하며, 수치적으로 더 안정적임 /  bfloat16은 특히 Transformer 기반 모델을 포함한 딥러닝 모델에 최적화되어 있음 / bfloat16은 float32에 비해 메모리 사용량을 절반으로 줄이면서도 모델의 정확도를 유지할 수 있음 / float16은 CPU에서 처리할 때 매우 느릴 수 있지만, bfloat16은 이러한 문제가 덜 발생함. 
→ 대규모 언어 모델을 다룰 때 bfloat16이 더 적합한 선택임 

댓글