Domain-Independent, User-centric Text Classification, and Clustering Framework

Yeasmin, Sumona; Afrin, Nazia; Saif, Kashfia

DSpace Home
→
Department of Computer Science & Engineering
→
B.Sc in Computer Science and Engineering
→
Thesis 2022
→
View Item

dc.contributor.author	Yeasmin, Sumona
dc.contributor.author	Afrin, Nazia
dc.contributor.author	Saif, Kashfia
dc.date.accessioned	2022-08-01T07:35:35Z
dc.date.available	2022-08-01T07:35:35Z
dc.date.issued	2022-06-30
dc.identifier.uri	http://dspace.ewubd.edu:8080/handle/123456789/3664
dc.description	This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh	en_US
dc.description.abstract	Traditional text document clustering and classification methods represent documents with uncontextualized word embeddings and vector space models. Recent text clustering and classification techniques often rely on word embeddings as a transfer learning component. We have explored the existing text document clustering and classification methodologies and evaluated their strengths and weaknesses. We have started with models based on Bag of Words and shifted towards transformer-based architectures. We have concluded that transformer-based embedding will be necessary to capture the contextual meaning. BERT's (Bidirectional Encoder Representations from Transformers) architecture produces robust word embeddings analyzing both from left to right and proper context. Several classification and clustering algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. This research has conducted experimental analysis on both classification and clustering algorithms to examine the output on two different datasets. The result analysis of the classification algorithm shows that the random forest classifier obtains around 75% accuracy which is higher than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing similar work and show up to 50% improvement in accuracy. The clustering analysis shows that the K-Means has obtained a maximum of 0.654 in Dunn index measurement and 0.135 in Silhouette coefficient, and DBSCAN has obtained a maximum of 0.115. Our capstone project introduces a novel domain-independent, user-centric text clustering, and classification framework. With a Multi-domain text clustering search system, an agent will perform based on user behavior with the user profile. Users will explore document collections by selecting multiple repositories. Users can upload an un-categorized document, and the developed Framework will find similar documents. The developed prototype provides context to the similarity and also finds similar documents within the same domain based on user preferences.	en_US
dc.language.iso	en_US	en_US
dc.publisher	East West University	en_US
dc.relation.ispartofseries	;CSE00197
dc.subject	Natural Language Processing, Classification, Clustering,· Transformer-based embedding, Contextual Similarity, Cluster-Domain Mapping	en_US
dc.title	Domain-Independent, User-centric Text Classification, and Clustering Framework	en_US
dc.type	Thesis	en_US