Evaluating quantized Large Language Models for code generation on low-resource language benchmarks

Loading...
Thumbnail Image
Files
Date
2025-08-02
Authors
Nyamsuren, Enkhbold
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier Ltd.
Research Projects
Organizational Units
Journal Issue
Abstract
Democratization of AI, which is making AI accessible and usable for everyone, is an important topic with the broader topic of the digital divide. This issue is especially relevant to Large Language Models (LLM) that are becoming increasingly popular as AI co-pilots but suffer from a lack of accessibility due to high computational demand. In this study, we evaluate whether LLM quantization is a viable approach toward enabling LLMs on generic consumer devices. The study assesses the performance of five quantized code LLMs in Lua and Python code generation tasks. All code LLMs had approximately 7 billion parameters and were deployed on a generic CPU-only consumer laptop. To evaluate the impact of quantization, the models were tested at 2-, 4-, and 8-bit integer precisions. Pass@1 and pass@10 evaluations were done at variable temperatures and token sampling rates. Along with tasks such as question answering, text summarization, and text generation, programming tasks are one of the popular applications of AI co-pilots. Furthermore, code generation is a high-precision task, which makes it a suitable benchmark to evaluate and compare quantized models for everyday use by individuals. Lua is chosen as a low-resource language to avoid models’ biases related to high-resource languages. Performance in Lua is contrasted against performance in Python, which was chosen as a high-resource language. The results suggest that the models quantized at the 4-bit integer precision offer the best trade-off between performance and model size. These models can be comfortably deployed on an average laptop without a dedicated GPU. The findings suggest that lower quantization precision adversely affects more the performance in low-resource languages than in high-resource languages. But it also hinted that the quantization to an integer precision from the full precision affects more the performance in high-resource language. The quantized models at 8-bit integer precision require more inference that does not effectively translate to better performance. While quantization indeed increases the accessibility of smaller LLMs with 7 billion parameters, these LLMs demonstrate overall low performance (less than 50%) on high-precision and low-resource tasks such as Lua code generation. While accessibility is improved, usability is still not at a practical level of foundational LLMs such as GPT-4o or Llama 3.1 with 405 billion parameters. Additionally, in the most failed instances, the models excel at generating code that is free of syntax errors but fails at unit tests or has runtime issues. This means that any generated code requires extensive testing that may negate any potential efficiency boost delivered by these smaller coding models.
Description
Keywords
Large Language Model , Lua , Python , Code generation , Quantization , AI democratization
Citation
Nyamsuren, E. (2025) 'Evaluating quantized Large Language Models for code generation on low-resource language benchmarks', Journal of Computer Languages, 84, 101351 (15pp). https://doi.org/10.1016/j.cola.2025.101351
Link to publisher’s version