EWU Institutional Repository

Domain-Independent, User-centric Text Classification, and Clustering Framework

Show simple item record

dc.contributor.author Yeasmin, Sumona
dc.contributor.author Afrin, Nazia
dc.contributor.author Saif, Kashfia
dc.date.accessioned 2022-08-01T07:35:35Z
dc.date.available 2022-08-01T07:35:35Z
dc.date.issued 2022-06-30
dc.identifier.uri http://dspace.ewubd.edu:8080/handle/123456789/3664
dc.description This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh en_US
dc.description.abstract Traditional text document clustering and classification methods represent documents with uncontextualized word embeddings and vector space models. Recent text clustering and classification techniques often rely on word embeddings as a transfer learning component. We have explored the existing text document clustering and classification methodologies and evaluated their strengths and weaknesses. We have started with models based on Bag of Words and shifted towards transformer-based architectures. We have concluded that transformer-based embedding will be necessary to capture the contextual meaning. BERT's (Bidirectional Encoder Representations from Transformers) architecture produces robust word embeddings analyzing both from left to right and proper context. Several classification and clustering algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. This research has conducted experimental analysis on both classification and clustering algorithms to examine the output on two different datasets. The result analysis of the classification algorithm shows that the random forest classifier obtains around 75% accuracy which is higher than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing similar work and show up to 50% improvement in accuracy. The clustering analysis shows that the K-Means has obtained a maximum of 0.654 in Dunn index measurement and 0.135 in Silhouette coefficient, and DBSCAN has obtained a maximum of 0.115. Our capstone project introduces a novel domain-independent, user-centric text clustering, and classification framework. With a Multi-domain text clustering search system, an agent will perform based on user behavior with the user profile. Users will explore document collections by selecting multiple repositories. Users can upload an un-categorized document, and the developed Framework will find similar documents. The developed prototype provides context to the similarity and also finds similar documents within the same domain based on user preferences. en_US
dc.language.iso en_US en_US
dc.publisher East West University en_US
dc.relation.ispartofseries ;CSE00197
dc.subject Natural Language Processing, Classification, Clustering,· Transformer-based embedding, Contextual Similarity, Cluster-Domain Mapping en_US
dc.title Domain-Independent, User-centric Text Classification, and Clustering Framework en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account