Addressing the gap between current language models and key-term-based clustering
dc.contributor.author | Cabral , Eric M. | en |
dc.contributor.author | Rezaeipourfarsangi , Sima | en |
dc.contributor.author | Oliveira , Maria Cristina F. | en |
dc.contributor.author | Milios , Evangelos E. | en |
dc.contributor.author | Minghim , Rosane | en |
dc.contributor.funder | Conselho Nacional de Desenvolvimento CientÃfico e Tecnológico | en |
dc.contributor.funder | Coordenação de Aperfeiçoamento de Pessoal de NÃvel Superior | en |
dc.contributor.funder | Fundação de Amparo à Pesquisa do Estado de São Paulo | en |
dc.contributor.funder | Natural Sciences and Engineering Research Council of Canada | en |
dc.contributor.funder | Boeing | en |
dc.contributor.funder | CALDO | en |
dc.contributor.funder | National Science and Technology Infrastructure Program | en |
dc.date.accessioned | 2023-11-03T14:35:38Z | |
dc.date.available | 2023-11-03T14:35:38Z | |
dc.date.issued | 2023-08-22 | en |
dc.description.abstract | This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections. | en |
dc.description.sponsorship | Conselho Nacional de Desenvolvimento CientÃfico e Tecnológico (Brazil, grant 301847/2017-7); Coordenação de Aperfeiçoamento de Pessoal de NÃvel Superior (Brazil, grant 88887.506862/2020-00); Fundação de Amparo à Pesquisa do Estado de São Paulo (Brazil, grant 2018/22214-6), | en |
dc.description.status | Peer reviewed | en |
dc.description.version | Accepted Version | en |
dc.format.mimetype | application/pdf | en |
dc.identifier.citation | Cabral, E. M., Rezaeipourfarsangi, S., Oliveira, M. C. F., Milios, E. E., and Minghim, R. (2023) 'Addressing the gap between current language models and key-term-based clustering', ACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August, pp. 1-10. doi: 10.1145/3573128. 3604900 | en |
dc.identifier.doi | 10.1145/3573128. 3604900 | en |
dc.identifier.endpage | 10 | en |
dc.identifier.isbn | 979-8-4007-0027-9 | en |
dc.identifier.startpage | 1 | en |
dc.identifier.uri | https://hdl.handle.net/10468/15192 | |
dc.language.iso | en | en |
dc.publisher | Association for Computing Machinery | en |
dc.relation.ispartof | ACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August. | en |
dc.rights | © 2023, the Authors. Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org | en |
dc.subject | Document clustering | en |
dc.subject | Interactive clustering | en |
dc.subject | User-centred clustering | en |
dc.subject | Clustering analysis | en |
dc.subject | Document embeddings | en |
dc.title | Addressing the gap between current language models and key-term-based clustering | en |
dc.type | Conference item | en |