You will work in teams of maximum three students from the previous assignment.

The programming code will be graded on both implementation and correctness. The written report will be graded on content, conclusions, and presentation. It must be formatted according to the given template (posted on Canvas). The report will be graded as if the values obtained from the code portion were correct. The report should be short and to the point. The length should be between 2 to 4 pages of text plus any tables and graphs.

**Assignment Goals and Tasks**

This assignment is intended to build the following skills:

Implementation of the iterative optimization Gradient Descent algorithms for solving a Linear Regression problem

Implementation of various regularization techniques

Polynomial regression

Learning curve

**Assignment Instructions**

**Note: **you are not allowed to use any Scikit-Learn models or functions.

**i.** The code should be written in a Jupyter notebook and the report should be

prepared as a PDF file.

**a. **Name the notebook

`<lastname1>_<firstname1>_<lastname2>_<firstname2>_assignment1.ipynb`

**b**. Name the PDF

`<lastname1>_<firstname1>_<lastname2>_<firstname2>_assignment1.pdf`

**ii. **The Jupyter notebook and the report should be submitted via webhandin. Only

one submission is required for each group.

**iii.** Use the cover page (posted on Canvas) as the front page of your report.

**iv. **Download the wine quality dataset from:

http://archive.ics.uci.edu/ml/datasets/Wine+Quality

You will be using the red wine dataset: “winequality-red.csv”

**Part A: Model Code **

**1. **Implement the following function that generates the polynomial and interaction features for a given degree of the polynomial.

**polynomialFeatures(X, degree)**

**Argument:**

**X : ndarray**

A numpy array with rows representing data samples and columns

representing features (d-dimensional feature).

**degree : integer**

The degree of the polynomial features. Default = 1.

**Returns:**

A new feature matrix consisting of all polynomial combinations of the features with degree equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [a, b, a2, ab, b2 ].

`mse(Y_true, Y_pred)`

**Arguments:**

**Y_true : ndarray**

1D array containing data with “float” type. True y values.

**Y_pred : ndarray**

1D array containing data with “float” type. Values predicted by your model.

**Returns:**

**cost : float**

It returns a float value containing mean squared error between Y_true and

**Y_pred.**

**Note**: these 1D arrays should be designed as column vectors.

**3.** Implement the following function to compute training and validation errors. It will be used to plot learning curves. The function takes the feature matrix X (usually the training data matrix) and the training size (from the “train_size” parameter) and by using cross-validation computes the average mse for the training fold and the validation fold. It iterates through the entire X with an increment step of the “train_size”. For example, if there are 50 samples (rows) in X and the “train_size” is 10, then the function will start from the first 10 samples and will successively add 10 samples in each iteration. During each iteration it will use k-fold cross-validation to compute the average mse for the training fold and the validation fold. Thus, for example, for 50 samples there will be 5 iterations (on 10, 20, 30, 40, and 50 samples) and for each iteration it will compute the cross-validated average mse for the training and the validation fold. For training the model (using the “fit” method) it will use the model parameters from the function argument. The function will return two arrays containing training and validation root-mean-square error (rmse) values.

```
learning_curve(model, X, Y, cv, train_size=1, learning_rate=0.01,
epochs=1000, tol=None, regularizer=None, lambd=0.0, **kwargs)
```

**Arguments:**

**model :**

object type that implements the “fit” and “predict” methods. An object of that type which is cloned for each validation.

**X : ndarray**

A numpy array with rows representing data samples and columns representing features.

**Y : ndarray**

A 1D numpy array with labels corresponding to each row of the feature matrix X.

**cv : int**

integer, to specify the number of folds in a k-fold cross-validation.

**train_sizes :** int or float

Relative or absolute numbers of training examples that will be used to generate the learning curve. If the data type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets.

**learning_rate : **float

It provides the step size for parameter update.

**epochs :** int

The maximum number of passes over the training data for updating the weight vector.

**tol :** float or None

The stopping criterion. If it is not None, the iterations will stop when (error > previous_error - tol). If it is None, the number of iterations will be set by the “epochs”.

**regularizer : **string

The string value could be one of the following: l1, l2, None. If it’s set to None, the cost function without the regularization term will be used for computing the gradient and updating the weight vector.

However, if it’s set to l1 or l2, the appropriate regularized cost function needs to be used for computing the gradient and updating the weight vector.

**lambd :** float

It provides the regularization coefficient. It is used only when the “regularizer” is set to l1 or l2.

**Returns: **

**train_scores : **

ndarray root-mean-square error (rmse) values on training sets.

**val_scores :**

ndarray root-mean-square error (rmse) values on validation sets.

**4. **[Extra Credit for 478 and Mandatory for 878] Implement the following function to plot the training and validation root mean square error (rmse) values of the data matrix X for various polynomial degree starting from 1 up to the value set by “maxPolynomialDegree”. It takes the data matrix X (usually the training data matrix) and the maxPolynomialDegree; and for each polynomial degree it will augment the data matrix, then use k-fold cross-validation to compute the average mse for both the training and the validation fold. For training the model (using the “fit” method) it will use the model parameters from the function argument. Finally, the function will plot the root-mean-square error (rmse) values for the training and validation folds for each degree of the data matrix starting from 1 up to the maxPolynomialDegree

`plot_polynomial_model_complexity(model, X, Y, cv, maxPolynomialDegree, learning_rate=0.01, epochs=1000, tol=None, regularizer=None, lambd=0.0, **kwargs)`

**Arguments: **

**model :**

object type that implements the “fit” and “predict” methods. An object of that type which is cloned for each validation.

**X : ndarray**

A numpy array with rows representing data samples and columns representing features.

**Y : ndarray**

A 1D numpy array with labels corresponding to each row of the feature matrix X.

**cv : int**

integer, to specify the number of folds in a (Stratified)K-Fold,

**maxPolynomialDegree : int**

It will be used to determine the maximum polynomial degree for X. For example, if it is set to 3, then the function will compute both the training and validation mse values for degree 1, 2 and 3.

**learning_rate :**

float It provides the step size for parameter update.

**epochs : int**

The maximum number of passes over the training data for updating the weight vector.

**tol : float or None **

The stopping criterion. If it is not None, the iterations will stop when (error > previous_error - tol). If it is None, the number of iterations will be set by the “epochs”.

**regularizer : string**

The string value could be one of the following: l1, l2, None. If it’s set to None, the cost function without the regularization term will be used for computing the gradient and updating the weight vector. However, if it’s set to l1 or l2, the appropriate regularized cost function needs to be used for computing the gradient and updating the weight

**vector. lambd :**

float It provides the regularization coefficient. It is used only when the “regularizer” is set to l1 or l2.

**Returns: **

There is no return value. This function plots the root-mean-square error (rmse) values for both the training set and the validation set for degree of X between 1 and maxPolynomialDegree.

**5.** Implement a Linear_Regression model class. It should have the following three methods. Note the that “fit” method should implement the batch gradient descent algorithm.

**a)** fit(self, X, Y, learning_rate=0.01, epochs=1000, tol=None, regularizer=None, lambd=0.0, **kwargs)

**Arguments: **

**X : ndarray**

A numpy array with rows representing data samples and columns representing features.

**Y : ndarray**

A 1D numpy array with labels corresponding to each row of the feature matrix X.

**learning_rate : **

float It provides the step size for parameter update.

**epochs : int**

The maximum number of passes over the training data for updating the weight vector.

**tol : float or None**

The stopping criterion. If it is not None, the iterations will stop when (error > previous_error - tol). If it is None, the number of iterations will be set by the “epochs”.

**regularizer : string**

The string value could be one of the following: l1, l2, None. If it’s set to None, the cost function without the regularization term will be used for computing the gradient and updating the weight vector. However, if it’s set to l1 or l2, the appropriate regularized cost function needs to be used for computing the gradient and updating the weight vector.

**Note**: you may define two helper functions for computing the regularized cost for “l1” and “l2” regularizers.

**lambd : float**

It provides the regularization coefficient. It is used only when the “regularizer” is set to l1 or l2. Returns: No return value necessary

**Note:** the “fit” method should use a weight vector “theta_hat” that contains the parameters for the model (one parameter for each feature and one for bias). The “theta_hat” should be a 1D column vector.

Finally, it should update the model parameter “theta” to be used in “predict” method as follows. self.theta = theta_hat

**b) predict(self, X)**

**Arguments: **

**X : ndarray**

A numpy array containing samples to be used for prediction. Its rows represent data samples and columns represent features.

**Returns:**

1D array of predictions for each row in X. The 1D array should be designed as a column vector.

**Note:**

the “predict” method uses the self.theta to make predictions.

**__init__(self)**

It’s a standard python initialization function so we can instantiate the class. Just “pass” this

**Part B: Data Processing **

**6.** Read in the winequality-red.csv file as a Pandas data frame.

**7. **Use the techniques from the recitation to summarize each of the variables in the dataset in terms of mean, standard deviation, and quartiles. Include this in your report. [3 pts]

**8.** Shuffle the rows of your data. You can use def = df.sample(frac=1) as an idiomatic way to shuffle the data in Pandas without losing column names. [2 pts]

**9.** Generate pair plots using the seaborn package. This will be used to identify and report the redundant features, if there is any.

**Part C: Model Evaluation**

**10. **Model selection via Hyperparameter tuning: Use the kFold function (known as sFold function from previous assignment) to evaluate the performance of your model over each combination of lambd, learning_rate and regularizer from the following sets:

**a.** lambd = [1.0, 0, 0.1, 0.01, 0.001, 0.0001]

**b.** learning_rate = [0.1, 0.01, 0.001, 0.001]

**c.** regularizer = [l1, l2]

**d. **Store the returned dictionary for each and present it in the report.

**e.** Determine the best model (model selection) based on the overall performance (lowest average error). For the error_function argument of the kFold function (known as sFold function from previous assignment), use the “mse” function. For the model selection don’t augment the features. In other words, your model selection procedure should use the data matrix X as it is.

**11.** Evaluate your model on the test data and report the mean squared error

**12. **Using the best model plot the learning curve. Use the rmse values obtained from the “learning_curve” function to plot this curve.

**13. **Determine the best model hyperparameter values for the training data matrix with polynomial degree 3 and plot the learning curve. Use the rmse values obtained from the “learning_curve” function to plot this curve.

**14. **[Extra Credit for 478 and Mandatory for 878] Using the plot_polynomial_model_complexity function plot the rmse values for the training and validation folds for polynomial degree 1, 2, 3, 4 and 5. Use the training data as input for this function. You need to choose the hyperparameter values judiciously to work on the higher-degree polynomial models.

**15.** [Extra Credit for both 478 & 878] Implement the Stochastic Gradient Descent Linear Regression algorithm. Using cross-validation determine the best model.

**Part D: Written Report **

**16.** Describe whether or not you used feature scaling and why or why not.

**17.** Describe whether or not you dropped any feature and why or why not.

**18. **In the lecture we have studied two types of Linear Regression algorithm: closedform solution and iterative optimization. Which algorithm is more suitable for the current dataset? Justify your answer.

**19. **Would the batch gradient descent and the stochastic gradient descent algorithm learn similar values for the model weights? Justify your answer. Let’s say that you used a large learning rate. Would that make any difference in terms of learning the weights by both algorithms?

**20.** Consider the learning curve of your model (degree 1). What conclusion can you draw from the learning curve about (a) whether your model is overfitting/underfitting and (b) its generalization error. Justify your answer.

**21.** Consider the learning curve of the 3rd degree polynomial data matrix. What conclusion can you draw from the learning curve about (a) whether your model is overfitting/underfitting and (b) its generalization error. Justify your answer.

## Comments