Addressing the gap between current language models and key-term-based clustering

Loading...
Thumbnail Image
Files
DocEng2023__ModKT.pdf(4.4 MB)
Accepted Version
Date
2023-08-22
Authors
Cabral , Eric M.
Rezaeipourfarsangi , Sima
Oliveira , Maria Cristina F.
Milios , Evangelos E.
Minghim , Rosane
Journal Title
Journal ISSN
Volume Title
Publisher
Association for Computing Machinery
Research Projects
Organizational Units
Journal Issue
Abstract
This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.
Description
Keywords
Document clustering , Interactive clustering , User-centred clustering , Clustering analysis , Document embeddings
Citation
Cabral, E. M., Rezaeipourfarsangi, S., Oliveira, M. C. F., Milios, E. E., and Minghim, R. (2023) 'Addressing the gap between current language models and key-term-based clustering', ACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August, pp. 1-10. doi: 10.1145/3573128. 3604900
Link to publisher’s version
Copyright
© 2023, the Authors. Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org