Abstract:
Traditional text document clustering and classification methods represent documents with uncontextualized word embeddings and vector space models. Recent text clustering and classification techniques often rely on word embeddings as a transfer learning component. We have explored the existing text document clustering and classification methodologies and evaluated their strengths and weaknesses. We have started with models based on Bag of Words and shifted towards transformer-based architectures. We have concluded that transformer-based embedding will be necessary to capture the contextual meaning. BERT's (Bidirectional Encoder Representations from Transformers) architecture produces robust word embeddings analyzing both from left to right and proper context. Several classification and clustering algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. This research has conducted experimental analysis on both classification and clustering algorithms to examine the output on two different datasets. The result analysis of the classification algorithm shows that the random forest classifier obtains around 75% accuracy which is higher than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing similar work and show up to 50% improvement in accuracy. The clustering analysis shows that the K-Means has obtained a maximum of 0.654 in Dunn index measurement and 0.135 in Silhouette coefficient, and DBSCAN has obtained a maximum of 0.115. Our capstone project introduces a novel domain-independent, user-centric text clustering, and classification framework. With a Multi-domain text clustering search system, an agent will perform based on user behavior with the user profile. Users will explore document collections by selecting multiple repositories. Users can upload an un-categorized document, and the developed Framework will find similar documents. The developed prototype provides context to the similarity and also finds similar documents within the same domain based on user preferences.
Description:
This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh