UCCIX: Irish-eXcellence Large Language Model

dc.contributor.authorTran, Khanh-Tungen
dc.contributor.authorO’Sullivan, Barryen
dc.contributor.authorNguyen, Hoang D.en
dc.contributor.funderScience Foundation Irelanden
dc.date.accessioned2025-04-23T15:33:48Z
dc.date.available2025-04-23T15:33:48Z
dc.date.issued2024en
dc.description.abstractThe development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM. We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages, requiring only a fraction of the textual data typically needed for training LLMs according to scaling laws. Our model, based on Llama 2-13B, outperforms much larger models on Irish language tasks with up to 12% performance improvement, showcasing the effectiveness and efficiency of our approach. We also contribute comprehensive Irish benchmarking datasets, including IrishQA, a question-answering dataset, and Irish version of MT-bench. These datasets enable rigorous evaluation and facilitate future research in Irish LLM systems. Our work aims to preserve and promote the Irish language, knowledge, and culture of Ireland in the digital era while providing a framework for adapting LLMs to other indigenous languages.en
dc.description.statusPeer revieweden
dc.description.versionPublished Versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.citationTran, K.-T., O’Sullivan, B. and Nguyen, H.D. (2024) ‘UCCIX: Irish-excellence Large Language Model’. arXiv. https://doi.org/10.48550/ARXIV.2405.13010en
dc.identifier.doihttps://doi.org/10.48550/ARXIV.2405.13010.en
dc.identifier.endpage4en
dc.identifier.startpage1en
dc.identifier.urihttps://hdl.handle.net/10468/17310
dc.language.isoenen
dc.publisherarXiven
dc.relation.projectinfo:eu-repo/grantAgreement/SFI/Research Centres Programme::Phase 2/12/RC/2289_P2/IE/INSIGHT_Phase 2 /en
dc.relation.projectinfo:eu-repo/grantAgreement/SFI/Centres for Research Training (CRT) Programme/18/CRT/6223/IE/SFI Centre for Research Training in Artificial Intelligence/en
dc.relation.urihttps://arxiv.org/abs/2405.13010en
dc.rights© 2024, the Authors.en
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en
dc.subjectLarge Language Modelsen
dc.subjectLLMsen
dc.subjectOpen-source Irish-based LLMen
dc.subjectIrish LLM systemsen
dc.titleUCCIX: Irish-eXcellence Large Language Modelen
dc.typeArticle (peer-reviewed)en
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2405.13010v1.pdf
Size:
986.25 KB
Format:
Adobe Portable Document Format
Description:
Published Version
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.71 KB
Format:
Item-specific license agreed upon to submission
Description: