top of page
Search

# Important Topics of Machine Learning

### Answer the below topics 1:

• Understand the measures that are used to evaluate the results of classification, describe what these are:

• Confusion matrix

• Precision, Recall, Accuracy rate

• Precision-Recall Curve

• ROC curve

• Explain in simple terms the concept of n-fold cross validation

Ans

Confusion Matrix

It is a performance measurement for machine learning classification problem. It is represented by N*N matrix. Where N is the number of target classes.

TP: It called “true positive”

FP: It called “false positive”

FN: It called “false negative”

TN: It called “True negative”

Precision, Recall, Accuracy rate

These are also the metrices for measurement for accuracy. By Using above confusion matrices we can easily find the these metrices easily with the help of below formulas:

It can be represented by mathematical formula:

Precision= True positive/(True positive + False positive)

Recall = True positive/(True positive + False negative)

Accuracy Rate = 2*((Precision * Recall)/(Precision + Recall))

Precision-Recall Curve & ROC curve

Precision-Recall Curve

• These curves are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.

• This curve can be calculated in scikit-learn using the precision_recall_curve() function that takes the class labels and predicted probabilities for the minority class and returns the precision, recall, and thresholds.

ROC curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

• True Positive Rate

• False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

` TPR = TP/(TP + FN)`

False Positive Rate (FPR) is defined as follows:

` FPR = FP/(FP + TN)`

Explain in simple terms the concept of n-fold cross validation

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into training and test set which is used for:

• A training set is used to train the model,

• And a test set to evaluate it.

In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples

Steps which is used for this:

• Split your entire dataset into k”folds”

• For each k-fold, build your model on k – 1 folds of the dataset.

• Record the error you see on each of the predictions

• Repeat this until each of the k-folds has served as the test set

• The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

### Answer the below topics 2:

Linear Regression

• What is the cost function for Linear Regression

Polynomial Regression

• Describe how polynomial regression works based on linear regression.

Ans:

Linear Regression : Cost Function of Linear Regression

• Linear Regression is a machine learning algorithm based on supervised learning. It used to predicts a real-valued output based on an input value.

• Cost function(F) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and true y value (y).

Where pred_i is predicted value and y_i is actual value

Polynomial Regression

Polynomial regression is a special case of linear regression where we fit a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variables.

Equation for Linear Regression:

where, Y is the target, x is the predictor, 𝜃0 is the bias, and 𝜃1 is the weight in the regression equation

This linear equation can be used to represent a linear relationship. But, in polynomial regression, we have a polynomial equation of degree n represented as:

Equation for Polynomial Regression :

### Answer the below topics 3:

Logistic Regression

• The formula that updates the weights of attributes for each iteration

Softmax Regression

• What is the purpose of Softmax Regression?

• Given a softmax Regression model, please calculate the probability that the input attribute belongs to each class.

Support Vector Machine

• Compare with logistic regression, what is the advantage of Support Vector Machine?

Ans:

Logistic Regression

The formula that updates the weights of attributes for each iteration

Logistic regression uses an equation as the representation, very much like linear regression. Input values (X) are combined linearly using weights or coefficient values to predict an output value (y).

Formula:

Softmax Regression

Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification.

In softmax regression (SMR), we replace the sigmoid logistic function by the so-called softmax function φ:

where we define the net input z as

Support Vector Machine

• Logistic regression and support vector machines are supervised machine learning algorithms. They are both used to solve classification problems.

• SVM tries to finds the “best” margin that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.

Advantages of Support Vector Machine (SVM)

1. Regularization capabilities: SVM has L2 Regularization feature. So, it has good generalization capabilities which prevent it from over-fitting.

2. Handles non-linear data efficiently: SVM can efficiently handle non-linear data using Kernel trick.

3. Solves both Classification and Regression problems: SVM can be used to solve both classification and regression problems. SVM is used for classification problems while SVR (Support Vector Regression) is used for regression problems.

### Answer the below topics 4:

Decision Tree

• How a Decision Tree Model is trained?

• How to make predication on an new instance and how to calculate the prediction probability?

• What is Gini Impurity Measure?

• How to calculate Gini Impurity Measure?

• What is Regularization?

• What are the typical way to regularize a tree model

Random Forest

• How a Random Forest is trained?

Ans:

Decision Tree

How a Decision Tree Model is trained?

Below some basic steps which is used(Step 1- Step 3) before train the decision tree:

Example:

```import   numpy as np
import   matplotlib.pyplot as   plt
from   sklearn.metrics   import f1_score
from   sklearn.model_selection   import train_test_split

#   Importing dataset

Step 2: Data Preprocessing

The most important part of Data Science is data preprocessing and feature engineering

In this we will dealing with the categorical variables in the data and also imputing the missing values.

Step 3: Creating Train and Test Sets

In this we split the data set in train and test set for predicting the result by using selecting target variable

Step 4: Building and Evaluating the Model(Train Model)

By using both the training and testing sets, it’s time to train our models and classify data. First, we will train a decision tree on this dataset:

Example:

```from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion   = 'entropy', random_state = 42)
dt.fit(X_train, Y_train)
dt_pred_train = dt.predict(X_train)```

What is Gini Impurity Measure?

Gini Impurity measures the disorder of a set of elements. It is calculated as the probability of mislabeling an element assuming that the element is randomly labeled according the the distribution of all the classes in the set.

Formula:

Where p1, p2 are class 1 , 2 probabilities.

How to calculate Gini Impurity Measure?

Let suppose 3 apples, 3 bananas and 6 cherries are given then we will find the GI as per below mathematical example

apples bananas cherries

count = 3 3 6

p = 3/12 3/12 6/12

= 1/4 1/4 1/2

GI = 1 - [ (1/4)^2 + (1/4)^2 + (1/2)^2 ]

= 1 - [ 1/16 + 1/16 + 1/4 ]

= 1 - 6/16

= 10/16

= 0.625

What is Regularization?

What are the typical way to regularize a tree model?

Regularization?

It is used to reduce the complexity of the regression function without actually reducing the degree of the underlying polynomial function. Or We can say it is attempt to solve the overfitting problem in statistical models.

What are the typical way to regularize a tree model?

There are several simple regularization methods:

• minimum number of points per cell: require that each cell (i.e., each leaf node) covers a given minimum number of training points.

• maximum number of cells: limit the maximum number of cells of the partition (i.e., leaf nodes).

• maximum depth: limit the maximum depth of the tree

How a Random Forest is trained?

Below example which is used to train the random forest in machine learning:

Example:

```from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion   = 'entropy', random_state = 42)
rfc.fit(X_train, Y_train)

#   Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training   Set Evaluation F1-Score=>',f1_score(Y_train,rfc_pred_train))```