Addressing the gap between current language models and key-term-based clustering

Thumbnail Image
DocEng2023__ModKT.pdf(4.4 MB)
Accepted Version
Cabral , Eric M.
Rezaeipourfarsangi , Sima
Oliveira , Maria Cristina F.
Milios , Evangelos E.
Minghim , Rosane
Journal Title
Journal ISSN
Volume Title
Association for Computing Machinery
Research Projects
Organizational Units
Journal Issue
This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.
Document clustering , Interactive clustering , User-centred clustering , Clustering analysis , Document embeddings
Cabral, E. M., Rezaeipourfarsangi, S., Oliveira, M. C. F., Milios, E. E., and Minghim, R. (2023) 'Addressing the gap between current language models and key-term-based clustering', ACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August, pp. 1-10. doi: 10.1145/3573128. 3604900
Link to publisher’s version
© 2023, the Authors. Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from