Addressing the gap between current language models and key-term-based clustering

dc.contributor.authorCabral , Eric M.en
dc.contributor.authorRezaeipourfarsangi , Simaen
dc.contributor.authorOliveira , Maria Cristina F.en
dc.contributor.authorMilios , Evangelos E.en
dc.contributor.authorMinghim , Rosaneen
dc.contributor.funderConselho Nacional de Desenvolvimento Científico e Tecnológicoen
dc.contributor.funderCoordenação de Aperfeiçoamento de Pessoal de Nível Superioren
dc.contributor.funderFundação de Amparo à Pesquisa do Estado de São Pauloen
dc.contributor.funderNatural Sciences and Engineering Research Council of Canadaen
dc.contributor.funderBoeingen
dc.contributor.funderCALDOen
dc.contributor.funderNational Science and Technology Infrastructure Programen
dc.date.accessioned2023-11-03T14:35:38Z
dc.date.available2023-11-03T14:35:38Z
dc.date.issued2023-08-22en
dc.description.abstractThis paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.en
dc.description.sponsorshipConselho Nacional de Desenvolvimento Científico e Tecnológico (Brazil, grant 301847/2017-7); Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Brazil, grant 88887.506862/2020-00); Fundação de Amparo à Pesquisa do Estado de São Paulo (Brazil, grant 2018/22214-6),en
dc.description.statusPeer revieweden
dc.description.versionAccepted Versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.citationCabral, E. M., Rezaeipourfarsangi, S., Oliveira, M. C. F., Milios, E. E., and Minghim, R. (2023) 'Addressing the gap between current language models and key-term-based clustering', ACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August, pp. 1-10. doi: 10.1145/3573128. 3604900en
dc.identifier.doi10.1145/3573128. 3604900en
dc.identifier.endpage10en
dc.identifier.isbn979-8-4007-0027-9en
dc.identifier.startpage1en
dc.identifier.urihttps://hdl.handle.net/10468/15192
dc.language.isoenen
dc.publisherAssociation for Computing Machineryen
dc.relation.ispartofACM Symposium on Document Engineering 2023 (DocEng ’23), Limerick, Ireland, 22-25 August.en
dc.rights© 2023, the Authors. Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.orgen
dc.subjectDocument clusteringen
dc.subjectInteractive clusteringen
dc.subjectUser-centred clusteringen
dc.subjectClustering analysisen
dc.subjectDocument embeddingsen
dc.titleAddressing the gap between current language models and key-term-based clusteringen
dc.typeConference itemen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DocEng2023__ModKT.pdf
Size:
4.4 MB
Format:
Adobe Portable Document Format
Description:
Accepted Version
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.71 KB
Format:
Item-specific license agreed upon to submission
Description: