Abstract:
In the era of Information and communications technology where data is fueling the growth of organizations, where companies ingest raw data in massive volumes from from countless sources. But the question is how can they examine the data which both insightful, useful meaningful. This is where Big Data comes to play. Apache Spark is an open-source framework that is used process Big Data. Apache Spark is the leading platform for large- scale SQL, batch processing, stream processing , and machine learning. But one of the major drawbacks is that the time taken for processing traditional algorithms is much longer and it is also difficult to process large volume of data. Here Apache Spark multi-node clustering comes into big rescue. Which is a collection of independent machines connected through a dedicated network to work as a single centralized data processing resource. Collaborative Filtering is becoming so popular now a-days. To handle huge data sets traditional recommender systems often face challenges. In order to overcome some difficulties, some restrictions we have implemented distributed approach to do parallel computing so that we can deal with big datasets. We used Apache Spark Multi Node Clustering to do this. We have used several clustering algorithms find the similarity between users. Finally, we compare the overall permanence for a single machine vs multi-node clustering machines. In terms of scalability Apache Spark maintains great position. We consider improving scalability, Robustness of the system and evaluation Parameters. We also implemented Clustering algorithm using PySpark. The work of Apache PySpark is used to elaborate efficient parallel Implementation of our recommendation system.
Description:
This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh.