Data Analysis Using Pyspark In Machine Learning | Codersarts

Codersarts AI
Jun 27, 2020
2 min read

Assignment Task

This assignment consists of two deliverables, being:

One code implementation (40%). The code file in Jupyter Notebook format and the relevant data set files should be contained within a folder named: Task 3_YourName_StudentNumber, the folder is then to be zipped and uploaded to blackboard.
A report (60%). The report must be uploaded as a separate file.

Part I - PySpark source code

Important Note: For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the semester. The data files are packaged properly with your code file.

In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks:

1. Exploratory data analysis

2. Recommendation engine

3. Classification

You need to choose a dataset from Kaggle (https://www.kaggle.com/datasets) to complete these tasks. Remember to include the data set file in your source code submission.

Note: In your notebook, please use

Heading 1 Markdown cell to separate each subtask.

Task 1.1: Exploratory data analysis

This subtask requires you to explore your dataset by

telling its number of rows and columns,
doing the data cleaning (missing values or duplicated records) if necessary
selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each to summarise it

Task 1.2: Recommendation engine

This subtask requires you to implement a recommender system on Collaborative filtering

with the Alternative Least Squares Algorithm. You need to include

Model training and predictions
Model evaluation using MSE

Task 1.3: Classification

This subtask requires you to implement a classification system with Logistic regression. You need to include

Logistic Regression model training
Model evaluation

Part II –Report

You are required to write a report with the following content:

Provide a high-level survey on the advances of data science in the past 2 years.
Explain how Spark fits into the field of data science. Compare Spark with its competitors.
Explain your design and implementation of the machine learning parts in your code, including the following topics:

1. Background of your selected data set

2. For each task, which learning algorithm is used and what are its key parameters and

how you set them up

3. For each task, provide comments/evaluation for the model learned

Your report should use the following template:

Table of Contents

1.0 Advancement of Data Science (550 words)

2.0 Spark in Data Science (200 words)

3.0 Machine Learning Implementation (250 words)

3.1 Data set

3.2 Collaborative filtering

Features of the model, key parameters and configuration

Evaluation

3.3 Logistic regression

Features of the model, key parameters and configuration

Evaluation

Feel free to contact us and take the advantages of Machine Learning assignment help services offered by us. We are the best assignment writing service provider and to solve all your academic worries. You can easily connect with us through phone, e-mail, or live chat. You can contact us anytime; our experts are always available for your help. Besides this, We will also provide CONSULTANCY for your app for FREE!

so, if you are still reading this and have an app idea, drop us a message, we can surely talk and discuss your project and get things done!. You are just one step away to get it done.

Data Analysis Using Pyspark In Machine Learning | Codersarts

Recent Posts

Comments