top of page

Search Results

737 results found with an empty search

  • Research Paper Implementation : Sequence to Sequence Learning with Neural Networks.

    ABSTRACT Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labelled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimisation problem easier. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • Predicting Cab Supply and Demand(Cab Booking System)

    About the project: Cab booking system is the process where renting a cab is automated through an app throughout a city. Using this app, people can book a cab from one location to another location. Being a cab booking app company, exploiting the understanding of cab supply and demand could increase the efficiency of their service and enhance the user experience by minimizing waiting time. Objective of this project is to combine historical usage patterns along with open data sources like weather data to forecast cab booking demand in a city. You will be provided with an hourly renting data span of two years. Data is randomly divided into train and test sets. You must predict the total count of cabs booked in each hour covered by the test set, using the information available prior to the booking period. You need to append the train_label dataset to train.csv as the ‘Total_booking’ column. Please find the descriptions of the columns present in the dataset below. datetime - hourly date + timestamp season - spring, summer, autumn, winter holiday - whether the day is considered a holiday workingday - whether the day is neither a weekend nor holiday weather - Clear , Cloudy, Light Rain, Heavy temp - temperature in Celsius atemp - "feels like" temperature in Celsius humidity - relative humidity windspeed - wind speed Total_booking - number of total booking DATASET The recommended datasets will be shared. You can download them from the LMS TASKS Following are the tasks, which need to be developed while executing the project: Task 1: 1. Visualize data using different visualizations to generate interesting insights. 2. Outlier Analysis 3. Missing value analysis 4. Visualizing Total_booking Vs other features to generate insights 5. Correlation Analysis Task 2: 1. Feature Engineering 2. Grid search 3. Regression Analysis 4. Ensemble Model Solution: Task 1: import pandas as pd # Here, I have append the rows of train dataset and test dataset df11=pd.read_csv('C:/Project_1_dataset/Dataset/train.csv') df111=pd.read_csv('C:/Project_1_dataset/Dataset/test.csv') df1=df11.append(df111,ignore_index=True) # Here, I have append the rows of train_label dataset and test_label dataset df22=pd.read_csv('C:/Project_1_dataset/Dataset/train_label.csv', header=None, names=['Total_Booking']) df222=pd.read_csv('C:/Project_1_dataset/Dataset/test_label.csv', header=None, names=['Total_Booking']) df2=df22.append(df222,ignore_index=True) # Here, I have Concatenate the columns of df1 and df2 dataset to get complete dataset. df = pd.concat([df1, df2], axis=1) df1=df you can change the path and add your own path where you can add these train and test datasets. # here, I have drop the duplicates rows from the dataset df. df=df.drop_duplicates() df Identifying Missing Values # Here, I have checked the Missing value in the dataset df.isnull().sum() Output: datetime 0 season 0 holiday 0 workingday 0 weather 0 temp 0 atemp 0 humidity 0 windspeed 0 Total_Booking 0 dtype: int64 # Here, I have checked the datatype of the given Columns. df.dtypes Output: datetime object season object holiday int64 workingday int64 weather object temp float64 atemp float64 humidity int64 windspeed float64 Total_Booking int64 dtype: object Outliers Analysis #Here, I have checked the outliers in the windspeed column through the boxplot graph. import seaborn as sns sns.boxplot(x=df['windspeed']) #Here, I have checked the outliers in the Humidity column through the boxplot graph. import seaborn as sns sns.boxplot(x=df['humidity']) As like above, you can get all outliers of each dataset columns: # Here, I have checked the outliers in the windspeed vs Total_Booking column through the scatter graph. import matplotlib.pyplot as plt import matplotlib fig, ax = plt.subplots(figsize=(16,8)) ax.scatter(df['windspeed'], df['Total_Booking']) ax.set_xlabel('Wind Speed') ax.set_ylabel('Total Booking') plt.show() Output: # Here, I am finding the zscore values of the numerical columns of the dataset df. from scipy import stats import numpy as np z = np.abs(stats.zscore(df[['windspeed','temp','atemp','humidity','Total_Booking']])) print(z) Output: [[0.5142603 0.24503701 0.24839078 0.78535767 1.7248125 ] [0.75963759 1.08700912 1.14227927 0.88928536 1.03002161] [1.12729319 1.85989326 2.07630932 0.61766615 0.29024652] ... [0.8819159 0.17594904 0.10975464 0.14999154 0.17983232] [0.46560752 0.38644207 0.28853234 1.66874304 0.89752458] [0.5142603 1.29750214 1.32105696 0.21375537 0.1790138 ]] # Here, I have checked where thresold is greater than 3. threshold = 3 print(np.where(z > 3)) # Here, I have drop the rows where zscore is less than 3 to remove the outliers of the dataset df df = df[(z < 3).all(axis=1)] # After removing the outliers, The dataset df is : df Output: # Here, You have to see in the boxplot graph that maximum outliers are remove from the windspeed column. Only three left import seaborn as sns sns.boxplot(x=df['windspeed']) Output: As like above, you can remove all outliers from other columns. Visualize data using different visualizations to generate interesting insights. # Show value counts for a weather Column of dataset df_o: import matplotlib.pyplot as plt import seaborn as sbn sbn.countplot(x='weather',data=df) plt.xticks(rotation=90) plt.show() Output: # Show value counts for a season Column and weather Column of dataset df_o: import seaborn as sbn sbn.countplot(x='season',data=df,hue='weather') plt.show() Output: # Show value counts for a season Column and Holiday Column of dataset df_o where 0 value represent No Holiday and 1 value represent Holiday: import seaborn as sbn sbn.countplot(x='season',data=df,hue='holiday') plt.show() Output: # Show value counts for a season Column and WorkingDay Column of dataset df_o where 0 value represent No WorkingDay and 1 value represent WorkingDay: import seaborn as sbn sbn.countplot(x='season',data=df,hue='workingday') plt.show() Output: # Here, I have draw the Pie chart between HoliDay, WorkingDay And No Holiday No Workingday to check the status in percentile. Not_Holiday_Not_workingday=df[(df.holiday==0) & (df.workingday==0)].shape[0] print('No working and No Holiday = ', Not_Holiday_Not_workingday) Holiday=df[(df.holiday==1)].shape[0] print('Total HoliDay = ', Holiday) WorkingDay=df[(df.workingday==1)].shape[0] print('Total Working Day = ', WorkingDay) plt.pie(x=[Not_Holiday_Not_workingday,WorkingDay,Holiday],labels=['No Holiday & No workingday','WorkingDay','Holiday'],explode=(.1,.1,.1),colors=['g','r','b'],autopct='%.2f',wedgeprops={'edgecolor':'k'}) plt.show() Output: Fetch specified value from dataset column value #Here, I have fetch the Months from the datetime column df1['booking_month'] = pd.to_datetime(df1.datetime, format='%m/%d/%Y %H:%M').dt.month_name() #Here, I have fetch the Days from the datetime column df1['booking_day'] = pd.to_datetime(df1.datetime, format='%m/%d/%Y %H:%M').dt.day_name() #The time of departure is in 24 hours format(22:20), we would like to bin it to get insights. #Here, I have decided to group hours into 4 bins. [0–5], [6–11], [12–17] and [18–23] are the 4 bins. df1['timing'] = pd.to_datetime(df1.datetime, format='%m/%d/%Y %H:%M') a = df1.assign(dept_session=pd.cut(df1.timing.dt.hour,[0,6,12,18,24],labels=['Night','Morning','Afternoon','Evening'])) df1['booking_session'] = a['dept_session'] #Here, I have fetch the Year from the datetime column l2=[] for i in range(0,df1.shape[0]): l2.append(df1.datetime[i][df1.datetime[i].rindex('/')+1:df1.datetime[i].rindex('/')+5]) df1['year']=pd.DataFrame(l2,columns=['year']) df1 Output: # Show value counts for a Months Column of dataset df. import matplotlib.pyplot as plt import seaborn as sbn sbn.countplot(x='booking_month',data=df1) plt.xticks(rotation=90) plt.show() Output: # Show value counts for a Day Column of dataset df. import matplotlib.pyplot as plt import seaborn as sbn sbn.countplot(x='booking_day',data=df1) plt.xticks(rotation=90) plt.show() Output: # Show value counts for a Year Column and booking_session column of dataset df. import matplotlib.pyplot as plt import seaborn as sbn sbn.countplot(x='year',data=df1,hue='booking_session') plt.xticks(rotation=90) plt.show() Output: Visualizing Total_booking Vs other features to generate insights # Show the Line Plot Between booking_Month vs Total_Booking column import matplotlib.pyplot as plt import seaborn as sbn sbn.lineplot(x="booking_month", y="Total_Booking", data=df) plt.xticks(rotation=90) Output: # Show the Line Plot Between booking_day vs Total_Booking column. import matplotlib.pyplot as plt import seaborn as sbn sbn.lineplot(x="booking_day", y="Total_Booking", data=df) plt.xticks(rotation=90) Output: Correlation Analysis: df.corr('spearman') Task 2 : Feature Engineering import pandas as pd # Here, I have append the rows of train dataset and test dataset df11=pd.read_csv('C:/Project_1_dataset/Dataset/train.csv') df111=pd.read_csv('C:/Project_1_dataset/Dataset/test.csv') df1=df11.append(df111,ignore_index=True) # Here, I have append the rows of train_label dataset and test_label dataset df22=pd.read_csv('C:/Project_1_dataset/Dataset/train_label.csv', header=None, names=['Total_Booking']) df222=pd.read_csv('C:/Project_1_dataset/Dataset/test_label.csv', header=None, names=['Total_Booking']) df2=df22.append(df222,ignore_index=True) # Here, I have Concatenate the columns of df1 and df2 dataset to get complete dataset. df = pd.concat([df1, df2], axis=1) df Output: # Here, I have checked the datatype of the given Columns. df.dtypes Output: datetime object season object holiday int64 workingday int64 weather object temp float64 atemp float64 humidity int64 windspeed float64 Total_Booking int64 dtype: object Contact us to get a complete solution or need any other related machine learning project help, then you can contact us at below contact detail: contact@codersarts.com

  • Classification of malignant and benign cells.

    Given a data set containing features of cells, we need to determine whether a cell is malignant or benign. This comes under classification. It is a type of supervised-learning. It categorises the data into different classes and also predicts the class of a data passed as an input. We will use sklearn module as it provides a range of supervised and unsupervised learning algorithms. It is designed to interoperate with with the Python numerical and scientific libraries NumPy and SciPy. First, we import various libraries that we will need along with the data from sklearn.datasets. import sklearn import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer from sklearn.metrics import roc_curve, auc from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline import seaborn as sns from sklearn.preprocessing import StandardScaler We also imported pandas to deal with dataframes, NumPy to work with arrays which also supports various mathematical functions, maplotlib and seaborn for data visualisation. Now, we need to load the data and convert it into a dataframe as it is easier to manipulate and analyse the data. data = load_breast_cancer() a=np.c_[data.data, data.target] columns = np.append(data.feature_names, ["target"]) df_cancer=pd.DataFrame(a,columns=columns) df_cancer.head() The head() function shows top 5 rows of our dataframe, we used it to see if the data has been correctly converted into a dataframe with right column names and to get a better understanding of our data. The dataframe looks like this: There are 10 different cell nuclei parameters: Radius: Distance from the centre to the perimeter. Perimeter: The value of core tumour. The total distance between the points give perimeter. Area: Area of cancer cells. Smoothness: This gives the local variation in the radius lengths. The smoothness is given by difference of radial length and mean lengths of the lines around it. Compactness: It is value of estimation of perimeter and area,it is given by (perimeter^2 / area - 1.0). Concavity: Severity of concave points is given . Smaller chords encapsulate small concavities better. This feature is affected by length Concave points: The concavity measures magnitude of contour concavities while concave points measures the number of concave points Symmetry: The longest chord is taken as major axis.The length difference between the line perpendicular to the major axis is taken. This is known as the symmetry. Fractal dimension: It is a measure of non linear growth. As the ruler used to measure the perimeter increases, the precision decreases and hence the perimeter decreases. This data is plotted using log scale and the downward slope gives us an approximation of fractal dimension Texture: Standard derivation of the Gray scale area. This is helpful to find out the variation. Higher value of all the shape features imply irregular contour which in turn implies a malignant cell. The worst and error values are taken because only few malignant cells maybe present in an given sample.To better correlate malignant cells, these values are taken. The surgery depends on the size of tumour hence worst values are necessary. The target value is zero for malignant and one for benign. We divide the data into two classes: Malignant and Benign. Malignant=df_cancer[df_cancer['target'] ==0] Benign=df_cancer[df_cancer['target'] ==1] We divide the feature names into three categories: mean, error and worst. mean_features= ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension'] error_features=['radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error'] worst_features=['worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'] We will create a function to plot histograms with 10 subplots. bins = 20 #Number of bins is set to 20, bins are specified to divide the range of values into intervals def histogram(features): plt.figure(figsize=(10,15)) for i, feature in enumerate(features): plt.subplot(5, 2, i+1) #subplot function: the number of rows are given as 5 and number of columns as 2, the value i+1 gives the subplot number, subplot numbers start with 1 sns.distplot(Malignant[feature], bins=bins, color='red', label='Malignant'); sns.distplot(Benign[feature], bins=bins, color='green', label='Benign'); plt.title(str(' Density Plot of: ')+str(feature)) plt.xlabel('X variable') plt.ylabel('Density Function') plt.legend(loc='upper right') plt.tight_layout() plt.show() We call the function for mean, error and worst features of the malignant and benign cells. histogram(mean_features) histogram(error_features) histogram(worst_features) We will now write a function to plot a ROC (Receiver Operating Characteristics). It is a measure of performance for classification problems at various threshold points. This curve is plotted True positive rate and False positive rate. Larger the area under the curve, better the model at distinguishing between two classes which in our case are malignant and benign. def ROC_curve(X,Y,string): X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) # Splitting the data for training and testing in 60/40 ratio model=LogisticRegression(solver='liblinear') #Using logistic regression model model.fit(X_train,y_train) probability=model.predict_proba(X_test) #Predicting probability fpr, tpr, thresholds = roc_curve(y_test, probability[:,1]) #False positive rate, True Positive Rate and Threshold is returned using this function roc_auc = auc(fpr, tpr) #The area under the curve is given by this function plt.figure() plt.plot(fpr, tpr, lw=1, color='green', label=f'AUC = {roc_auc:.3f}') plt.plot([0,1],[0,1],linestyle='--',label='Baseline') #Plotting the baseline plt.title(string) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate ') plt.legend() plt.show() ROC_curve(df_cancer[mean_features],df_cancer['target'],'ROC for mean features ') ROC_curve(df_cancer[error_features],df_cancer['target'],'ROC for error features') ROC_curve(df_cancer[worst_features],df_cancer['target'],'ROC for worst features') An excellent model has area under the curve near to 1 which means it has good measure of separability. In the above ROC curves we can see that mean and worst features show high accuracy. Therefore, we will not take the error features in consideration. Also, in the histograms plotted above we see that there is an overlapping between the features of malignant and benign cells. In order to make our model to distinguish better we need to select features with the least overlapping values. The top 5 features according to this are: worst area worst perimeter worst radius mean concave points mean concavity We will save these feature in a list called imp_features: imp_features=['worst area','worst perimeter','worst radius','mean concave points','mean concavity'] The mean of all the instances of all features for both Benign and Malignant classes are: m_feature_space=Malignant.mean(axis=0) b_feature_space=Benign.mean(axis=0) Now we, concatenate the two dataframes one with Benign and one with Malignant and calculate the mean between the values corresponding to the same features. z=pd.concat([m_feature_space,b_feature_space],axis=1) analysis_point=z.mean(axis=1) analysis_point.head() #Analysis point The output is: mean radius 14.804677 mean texture 19.759834 mean perimeter 96.720392 mean area 720.583306 mean smoothness 0.097688 dtype: float64 Creating X and Y, where X has all the features and Y contains target: X=df_cancer.drop(['target'],axis=1) Y=df_cancer['target'] Now, our data has been preprocessed and is ready to be trained. But before that, since we have an imbalanced data set due to only a few number of malignant cells, we will use the SMOTE (Synthetic Minority Oversampling TEchnique) from imbalanced-learn module. A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. This technique will create new examples from the minority class (i.e. malignant cells). SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. We will split our data into training and test set and then we will build a pipeline from the imblearn library as it can also include oversampling technique (SMOTE in this case). Using this pipeline we will implement StandardScaler,SMOTE, and DecisionTreeClassifier on our data. Also, we will use GridSearch for the hyper-parameter search of the features max_depth and min_leaf for DecisionTreeClassifier. The program code is given below: from imblearn.over_sampling import SMOTE from imblearn.pipeline import Pipeline max_depth = list(range(1,24)) min_leaf=list(range(1,20)) params = [{'classifier__max_depth':max_depth,'classifier__min_samples_leaf':min_leaf}] #Defining parameters for the grid search X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',DecisionTreeClassifier(random_state=0,min_samples_split=6,max_features=10))]) #Creating a pipeline grid_search_cv = GridSearchCV(pipe,params,scoring='accuracy',refit=True, verbose=1,cv=5) #Grid Search function which will put different combinations of the parameters grid_search_cv.fit(X_train,y_train) Specified a value for max_features, max_features gives us how many features should be taken a time when taking the best split, if we have too many features it will have computationally heavy. Taking the value of 10, as we have 30 features. The min_samples_split is used to control overfitting, the ideal value for it should be between 1 to 40.If the value is too low we see overfitting. Decison Trees don't generally require scaling but we used it here to compare the decision tree with SVM. There is no drastic change in decision trees with scaling. The different class sizes might result in bias, although the difference is not very huge, it still is better to have a balanced class data set. we applied oversampling using the smote function to solve this problem. The max_depth for a decision tree should be equal to or less than the square-root of the instances for most optimum case, hence we choose the range of 1 to 24. If the depth is too large we see over-fitting and if too low we see under-fitting. The min_samples_leaf gives the minimum samples to become a leaf node. Too low value will give over-fitting and too large value will make it computationally expensive, hence we take the range to be 1 to 20. The output is as follows: GridSearchCV(cv=5, error_score=nan, estimator=Pipeline(memory=None, steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('smt', SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1, out_step='deprecated', random_state=None, ratio=None, sampling_strategy='auto', svm_estimator='deprecated')), ('classifier', DecisionTreeClassi... presort='deprecated', random_state=0, splitter='best'))], verbose=False), iid='deprecated', n_jobs=None, param_grid=[{'classifier__max_depth': [1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,23], 'classifier__min_samples_leaf': [1, 2, 3, 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}], pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='accuracy', verbose=1) Finding the best model from grid search model=grid_search_cv.best_estimator_ We will check the accuracy of the model: from sklearn.metrics import accuracy_score model.fit(X_train,y_train) #Fitting the model test_pred = model.predict(X_test) print(accuracy_score(y_test, test_pred)) #accuracy score function, to print the accuracy of the model y_test.value_counts() The output is: 0.9429824561403509 1.0 136 0.0 92 Name: target, dtype: int64 We will define a variable params to save the parameters of the model: params=model.get_params() Now, we will produce the confusion matrix and classification report. from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1])) #Creating confusion matrix pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer']) #Labelling the matrix print(classification_report(y_test, test_pred)) #The classification report gives precision,recall and f1 score Failing to detect a sample which has cancer means we look at the intersection of cancer and predicted no cancer. The number is 4 out of 91. When a person has cancer and it is detected as no cancer, the chance of happening so is 0.043. The weakness of the classifier is the computational overhead. The strength of the classifier shows good accuracy and chances of detecting a sample as no cancer while it is cancer is not very much which is desired in this case. We will now plot the decision tree and train our data on it. We will plot the scatter plots of imp_features. from sklearn import tree plt.figure(figsize=(40,40)) tree.plot_tree(model['classifier']) #function used to plot decision tree Output: [Text(1116.0, 1993.2, 'X[7] <= 0.128\ngini = 0.5\nsamples = 440\nvalue = [220, 220]'), Text(558.0, 1630.8000000000002, 'X[23] <= 0.083\ngini = 0.182\nsamples = 227\nvalue = [23, 204]'), Text(279.0, 1268.4, 'X[1] <= 0.488\ngini = 0.057\nsamples = 205\nvalue = [6, 199]'), Text(139.5, 906.0, 'gini = 0.0\nsamples = 163\nvalue = [0, 163]'), Text(418.5, 906.0, 'X[27] <= -0.302\ngini = 0.245\nsamples = 42\nvalue = [6, 36]'), Text(279.0, 543.5999999999999, 'X[17] <= -0.261\ngini = 0.062\nsamples = 31\nvalue = [1, 30]'), Text(139.5, 181.19999999999982, 'gini = 0.0\nsamples = 21\nvalue = [0, 21]'), Text(418.5, 181.19999999999982, 'gini = 0.18\nsamples = 10\nvalue = [1, 9]'), Text(558.0, 543.5999999999999, 'gini = 0.496\nsamples = 11\nvalue = [5, 6]'), Text(837.0, 1268.4, 'X[24] <= -0.355\ngini = 0.351\nsamples = 22\nvalue = [17, 5]'), Text(697.5, 906.0, 'gini = 0.48\nsamples = 10\nvalue = [6, 4]'), Text(976.5, 906.0, 'gini = 0.153\nsamples = 12\nvalue = [11, 1]'), Text(1674.0, 1630.8000000000002, 'X[21] <= -0.321\ngini = 0.139\nsamples = 213\nvalue = [197, 16]'), Text(1395.0, 1268.4, 'X[20] <= 0.266\ngini = 0.493\nsamples = 34\nvalue = [19, 15]'), Text(1255.5, 906.0, 'gini = 0.117\nsamples = 16\nvalue = [1, 15]'), Text(1534.5, 906.0, 'gini = 0.0\nsamples = 18\nvalue = [18, 0]'), Text(1953.0, 1268.4, 'X[0] <= -0.227\ngini = 0.011\nsamples = 179\nvalue = [178, 1]'), Text(1813.5, 906.0, 'gini = 0.18\nsamples = 10\nvalue = [9, 1]'), Text(2092.5, 906.0, 'gini = 0.0\nsamples = 169\nvalue = [169, 0]')] It is a tree flowchart, each observation splits according to some feature. There are two ways to go from each node if the condition is true it goes one way and if false it goes the other way. The first line(here X7),gives the feature and compares it to some value. The second row gives us the value of gini index at every node. Gini index is computed mathematically. Gini index=0 means the node is perfect and we get definite class. The sample row gives us the number of samples being considered. The value row in each node gives us the number of samples in each class. In all the nodes the features are considered but the feature which gives best gini index is chosen. clf=DecisionTreeClassifier(random_state=0,min_samples_leaf=2,min_impurity_split=6,max_depth=11) #Replicating the decision tree classifier as our classifier had max_features as 10 which can not be applied here, as the features taken are 2 k=1 plt.figure(figsize=(20,40)) for i in range(0,4): for j in range(1,5): inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1) #Taking data from two features clf.fit(inp,Y) plt.subplot(4, 4, k) k=k+1 plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30) #Creating scatter plot ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50)) #Creating a meshgrid of data points Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z,alpha=0.5, cmap=plt.cm.Paired, linestyles=['--','+','--']) plt.title(str(imp_features[i])+' & '+str(imp_features[j])) We will write a code to print the important features according to the grid search and then compare it with the important features selected by us at the beginning of the program. feat_importances = pd.Series(model['classifier'].feature_importances_, index=X.columns) #function to save the most important features feat_importances = feat_importances.nlargest(5) #as we need only 5 features nlargest() is used feat_importances.plot(kind='barh',figsize=(12,8),title='Most Important Features') #plotting bar graph imp_features=list(feat_importances.index) print(feat_importances) Output: mean concave points 0.756231 worst area 0.138760 worst texture 0.072523 worst radius 0.019084 mean texture 0.010388 dtype: float64 As we can see that the important features predicted by us and the grid search are the same. Hence, it will be safe to say that the program is running as expected. Now, let's train our data using SVM (Support Vector Machine). It is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Firstly, we will use GridSearch to select the best values of C and gamma. Also, we will use a pipeline to implement StandardScaler, SMOTE, and SVM Classifier on our data. from sklearn.svm import SVC from imblearn.over_sampling import SMOTE from imblearn.pipeline import Pipeline c=[0.01,0.1,1,10] gamma=[0.01,0.1,1,10] params = [{'classifier__C':c,'classifier__gamma':gamma}] #Setting the parameters X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',SVC(kernel='rbf'))]) #Creating the pipeline grid_search_cv = GridSearchCV(pipe,params,refit=True, verbose=1,cv=5) grid_search_cv.fit(X_train,y_train) Ouput: GridSearchCV(cv=5, error_score=nan, estimator=Pipeline(memory=None, steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('smt', SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1, out_step='deprecated', random_state=None, ratio=None, sampling_strategy='auto', svm_estimator='deprecated')), ('classifier', SVC(C=1.0, break_ti... decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))], verbose=False), iid='deprecated', n_jobs=None, param_grid=[{'classifier__C': [0.01, 0.1, 1, 10], 'classifier__gamma': [0.01, 0.1, 1, 10]}], pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring=None, verbose=1) In the output, both the C value and gamma are in the range of 10. C is a regularization parameter so we choose the range as 0.01, 0.1,1 and 10. Similarly for gamma we choose 0.01, 0.1, 1 and 10. Value less than 0.01 would have been too low and value more than 10 would have been too high. Hence, we choose this range. Gridsearch CV gives the best combination of these two features. We will check the accuracy of the model: svc=grid_search_cv.best_estimator_ #Saving the best estimator svc.fit(X_train,y_train) test_pred = svc.predict(X_test) print(accuracy_score(y_test, test_pred)) Output: 0.9868421052631579 We will create a confusion matrix and classification report for this model as well. from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1])) pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer']) print(classification_report(y_test, test_pred)) People that have cancer but are predicted with no cancer are 3 out of 91 in this model, which is better than the decision tree classifier. The chances of failing to detect cancer is 0.03. The advantage of support vector classifier is it is relatively more efficient. The disadvantage is we need to scale the data before using it as support vector machine can show bias towards a feature if the data is not scaled. Visualising the result: k=1 plt.figure(figsize=(20,40)) for i in range(0,4): for j in range(1,5): inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1) s=svc['classifier'].fit(inp,Y) decision_function = svc['classifier'].decision_function(inp) plt.subplot(4, 4, k) k=k+1 plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30, cmap=plt.cm.Paired) ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50)) xy = np.vstack([xx.ravel(), yy.ravel()]).T Z = svc['classifier'].decision_function(xy).reshape(xx.shape) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--']) ax.scatter(s.support_vectors_[:, 0], s.support_vectors_[:, 1], s=10,linewidth=1, facecolors='none', edgecolors='k') #Showing support vectors plt.title(str(imp_features[i])+' & '+str(imp_features[j])) The model is not showing over-fitting as it is giving good accuracy in testing data as well. Over-fitting occurs when the accuracy in training data is very high but in test data is low.

  • Working With Weather Database

    Step 1 – Import the Weather Database into your SQL Server Management Studio 1. Open SQL Server Management Studio. Note: a recorded copy of the following steps can be found at https://njit.webex.com/mw3300/mywebex/nbrshared.do 2. Right click on database in the navigation window and select New Database 3. When the screen below appears, enter Weather as shown in the circle below and press the OK button 4. Next right click the Weather database and select Tasks and then Import Data as shown below 5. The import wizard will start and the following window will open. Click the NEXT button 6. When the window below opens, select FLAT FILE as the data source and the window will expand to have the following additional selections 7. Click the Browse button and navigate to where you unzipped the zip file in step 2 and select the AQS_Sires.txt file as shown below 8. Select the COLUMNS option in the left hand navigation tab to make sure the columns have been properly identified. The screen should look like the following 9. Next select the ADVANCED option and the following screen will appear. Once it does, you need to change the OUTPUT COLUMN WIDTH to 255 by click on the original width (shown as 50) and overwriting it with 255. This must be done for the Met_Site_Type, Met_Site_Direction, Owning_Agency, Local_Site_Name, Address, City Name, CBSA_Name and Tribe_Name and then click the NEXT button 10. Next, change the Destination to SQL Server Native Client and make sure the Server and Authentication information is correct and then hit NEXT 11. Hit NEXT when the following window appears 12. Hit the FINISH button when the following windows appears 13. Click FINISH when the following screen appears 14. Hit FINISH and the following screen will show the progress and have the below information when it finishes Database Description 1. The database contains 2 tables a. Aqs_sites – contains information regarding the sites where the temperature information in the Temperature table was collected. All the column should be self explanatory. The linkage between the 2 tables use the State_Code, County_Code and Site_Number columns b. Temperature – contains the daily temperature information collected at the site for 2 decades. The important columns for the assignment are: i. State_Code, County_Code and Site_Number are used to join to the aqs_sites table ii. Date_Local – The date that the sample was collected iii. Average_Temp – The average temperature for that particular date iv. Daily_High_Temp – The highest temperature for the day v. All temperatures are in degrees Fahrenheit 2. Suggestion: Due to the large number of records in the database, your queries may take several minutes to execute. If your queries are taking a long time to run, I suggest you make a table containing a small subset of the data in the Temperature table to use for writing and debugging your queries. After they all execute successfully, you can change the queries to use the full Temperature table 3. Grading – You will receive separate grades for Part 3, Part 4 and Part 5 (if you complete Part 5 for the extra credit). Both parts 3 and 4 will be worth 100 points each and each question in the step will be worth a proportionate value (1/# of questions in the step) Parts 3 and 4 will make up 50% of the Project grade. The extra credit (Part 5) will be worth 15% extra. Creating Geospatial Data Your last concern is how long will it take to travel back home to visit friends and family after you move. Since the Weather database has latitude and longitude information, you have decided to convert this information into a new column with a GEOGRAPHY data type and populate the new column with a set command and one of the following formula (Depending on the data type for latitude and longitude) The example below is for when latitude and longitude are varchar. Use Weather go IF NOT EXISTS( SELECT * FROM sys.columns WHERE Name = N'GeoLocation' AND Object_ID = Object_ID(N'AQS_Sites')) BEGIN ALTER TABLE AQS_Sites ADD GeoLocation Geography NULL END go UPDATE aqs_sites SET GeoLocation = geography::STPointFromText('POINT(' + LONGITUDE + ' ' + [Latitude] + ')', 4326) where (LATITUDE is not null and Longitude is not null) and Longitude <> '0' and Longitude <>'' Submission 1 – Problems You are trying to decide where in the US to reside. The most important factor to you is temperature, you hate cold weather. Answer the following questions to help you make your decision. For all problems show all columns included in the examples. Note that the term temperature applies to the average daily temperature unless otherwise stated. 1. Determine the date range of the records in the Temperature table First Date Last Date 1986-01-01 2017-05-09 2. Find the minimum, maximum and average of the average temperature column for each state sorted by state name. State_Name Minimum Temp Maximum Temp Average Temp Alabama -4.662500 88.383333 59.328094 Alaska -43.875000 80.791667 29.146757 Arizona -99.000000 135.500000 67.039050 3. The results from question #2 show issues with the database. Obviously, a temperature of -99 degrees Fahrenheit in Arizona is not an accurate reading as most likely is 135.5 degrees. Write the queries to find all suspect temperatures (below -39o and above 105o). Sort your output by State Name and Average Temperature. State_Name state_code County_Code Site_Number average_Temp date_local Wisconsin 55 059 0002 -58.000000 2002-03-28 Washington 53 009 0013 -50.000000 2012-10-17 Texas 48 141 0050 106.041667 1991-07-28 Texas 48 141 0050 106.291667 1991-07-25 4. You noticed that the average temperatures become questionable below -39 o and above 125 o and that it is unreasonable to have temperatures over 105 o for state codes 30, 29, 37, 26, 18, 38. You also decide that you are only interested in living in the United States, not Canada or the US territories. Create a view that combines the data in the AQS_Sites and Temperature tables. The view should have the appropriate SQL to exclude the data above. You should use this view for all subsequent queries. My view returned 5,616,112 rows. The view includes the State_code, State_Name, County_Code, Site_Number, Make sure you include schema binding in your view for later problems. 5. Using the SQL RANK statement, rank the states by Average Temperature State_Name Minimum Temp Maximum Temp Average Temp State_rank Florida 35.96 88.00 73.348137 1 Texas -1.13 122.60 68.793757 2 Mississippi 22.23 91.16 68.493975 3 6. At this point, you’ve started to become annoyed at the amount of time each query is taking to run. You’ve heard that creating indexes can speed up queries. Create an index for your view. You are required to create an index with the unique and clustered parameters and the index will be on the State_Code, County_Code, Site_Number, Date_Local columns. Note: There are a couple of thousand duplicate rows that you must delete before you can create a unique index. I used the Rownumber parameter in a partition statement and deleted any row where the row number was greater than 1. To see if the indexing help, add print statements that write the start and stop time for the query in question #2 and run the query before and after the indexes are created. Note the differences in the times. Also make sure that the create index steps include a check to see if the index exists before trying to create it. The following is a sample of the output that should appear in the messages tab that you will need to calculate the difference in execution times before and after the indexes are created Begin Question 6 before Index Create At - 13:40:03 (777 row(s) affected) Complete Question 6 before Index Create At - 13:45:18 7. You’ve decided that you want to see the ranking of each high temperatures for each city in each state to see if that helps you decide where to live. Write a query that ranks (using the rank function) the states by averages temperature and then ranks the cities in each state. The ranking of the cities should restart at 1 when the query returns a new state. You also want to only show results for the 15 states with the highest average temperatures. Note: you will need to use multiple nested queries to get the State and City rankings, join them together and then apply a where clause to limit the state ranks shown. State_Rank State_Name State_City_Rank City_Name Average Temp 1 Florida 1 Not in a City 73.975759 1 Florida 2 Pinellas Park 72.878784 1 Florida 3 Valrico 71.729440 1 Florida 4 Saint Marks 69.594272 2 Texas 1 McKinney 76.662423 2 Texas 2 Mission 74.701098 8. You notice in the results that sites with Not in a City as the City Name are include but do not provide you useful information. Exclude these sites from all future answers. You can do this by either adding it to the where clause in the remaining queries or updating the view you created in #4 9. You’ve decided that the results in #8 provided too much information and you only want to 2 cities with the highest temperatures and group the results by state rank then city rank. State_Rank State_Name State_City_Rank City_Name Average Temp 1 Florida 1 Pinellas Park 72.878784 1 Florida 2 Valrico 71.729440 2 Louisiana 1 Baton Rouge 69.704466 2 Louisiana 2 Laplace (La Place) 68.115400 10. You decide you like the average temperature to be in the 80's. Pick 2 cities that meets this condition and calculate the average temperature by month for those 2 cities. You also decide to include a count of the number of records for each of the cities to make sure your comparisons are being made with comparable data for each city. Hint, use the datepart function to identify the month for your calculations. City_Name Month # of Records Average Temp Mission 1 620 60.794048 Mission 2 565 64.403861 Mission 3 588 69.727512 11. You assume that the temperatures follow a normal distribution and that the majority of the temperatures will fall within the 40% to 60% range of the cumulative distribution. Using the CUME_DIST function, show the temperatures for the same 3 cities that fall within the range. City_Name Avg_Temp Temp_Cume_Dist Mission 73.916667 0.400686891814539 Mission 73.956522 0.400829994275902 Mission 73.958333 0.402404121350887 12. You decide this is helpful, but too much information. You decide to write a query that shows the first temperature and the last temperature that fall within the 40% and 60% range for the 3 cities your focusing on. City_Name 40 Percentile Temp 60 Percentile Temp Mission 73.956522 80.083333 Pinellas Park 71.958333 78.125000 Tucson 63.750000 74.250000 13. You remember from your statistics classes that to get a smoother distribution of the temperatures and eliminate the small daily changes that you should use a moving average instead of the actual temperatures. Using the windowing within a ranking function to create a 4 day moving average, calculate the moving average for each day of the year. Hint: You will need to datepart to get the day of the year for your moving average. You moving average should use the 3 days prior and 1 day after for the moving average. City_Name Day of the Year Rolling_Avg_Temp Mission 1 59.022719 Mission 2 58.524868 Mission 3 58.812967 Mission 364 60.657749 Mission 365 61.726333 Mission 366 61.972514 We are also providing other database assignments help like, MongoDB, Oracle, PostgreSQL, MySQL, etc. If you need or looking for any database assignments or project help then you can please contact us at: contact@codersarts.com

  • Visualization Techniques Used in Machine Learning

    Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends, and correlations that might not otherwise be detected can be exposed. To get a little overview here are a few popular plotting libraries: Matplotlib: low level, provides lots of freedom Pandas Visualization: easy to use interface, built on Matplotlib Seaborn: high-level interface, great default styles Plotly: can create interactive plots To understand these all data visualization tools let's deep dive into the code part. For this here the Air Quality dataset from 2015 to 2020 is used. Now to implement these following libraries we have to import it first.(Make sure these are installed on your system.) Import the libraries Libraries to imported are: NumPy pandas plotly.express plotly.offline cufflinks matplotlib seaborn import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') %matplotlib inline import seaborn as sns import plotly import plotly.express as px import plotly.graph_objects as go #plt.rcParams['figure.figsize']=17,8 import cufflinks as cf import plotly.offline as pyo from plotly.offline import init_notebook_mode,plot,iplot Read the dataset through pandas df=pd.read_csv('city_day.csv') Here is what the dataset looks like df.head() Now let's plot yearly changes of SO2 plot using Plotly Library Before that, we have to group by each column wrt to the date column. The below code is shown that SO2=df.groupby('year')['SO2'].sum().reset_index().sort_values(by='year',ascending=False) NO2=df.groupby('year')['NO2'].sum().reset_index().sort_values(by='year',ascending=False) BTX=df.groupby('year')['BTX'].sum().reset_index().sort_values(by='year',ascending=False) CO=df.groupby('year')['CO'].sum().reset_index().sort_values(by='year',ascending=False) PM=df.groupby('year')['PM2.5'].sum().reset_index().sort_values(by='year',ascending=False) O=df.groupby('year')['O3'].sum().reset_index().sort_values(by='year',ascending=False) Now let's plot for SO2:- Line plot:(Modes="Lines+markers") SO2.iplot(kind='line',mode='lines+markers',x='year',y='SO2',title='AMOUNT OF SO2 IN DIFFERENT YEARS ') Let's check for the Table + Bar plot in plotly library. trace = go.Table( domain=dict(x=[0, 0.52], y=[0, 1.0]), header=dict(values=["City","SO2"], fill = dict(color = '#119DFF'), font = dict(color = 'white', size = 14), align = ['center'], height = 30), cells=dict(values=[S['City'].head(10),S['SO2'].head(10)], fill = dict(color = ['lightgreen', 'white']), align = ['center'])) trace1 = go.Bar(x=S['City'].head(10), y=S['SO2'].head(10), xaxis='x1', yaxis='y1', marker=dict(color='lime'),opacity=0.60) layout = dict( width=830, height=420, autosize=False, title='TOP 10 Cities with Max SO2', showlegend=False, xaxis1=dict(**dict(domain=[0.58, 1], anchor='y1', showticklabels=True)), yaxis1=dict(**dict(domain=[0, 1.0], anchor='x1', hoverformat='.2f')), ) fig1 = dict(data=[trace, trace1], layout=layout) iplot(fig1) So here we have made a table with the top 10 cities with the max amount of S02. and the bar graph is drawn in the very next. Point plot using Seaborn: Let's see how to draw a Point plot using seaborn. plt.subplots(figsize =(15,8)) sns.pointplot(x='month', y='SO2', data=df,color='Orange') plt.xlabel('MONTHS',fontsize = 16,color='blue') plt.ylabel('SO2',fontsize = 16,color='blue') plt.title('SO2 in Different Months',fontsize = 20,color='blue') plt.savefig('loc\\SO2_monthly') So here we can plot the amount of SO2 for different months. Subplots: We can plot some plots together like in a gallery view in all. Let's see the below example where we have plotted all the pollutants changes with respect to different years. from plotly.tools import make_subplots trace1=go.Scatter(x=SO2['year'], y=SO2['SO2'], mode='lines+markers', name='NO2') trace2=go.Scatter(x=NO2['year'], y=NO2['NO2'], mode='lines+markers', name='NO2') trace3=go.Scatter(x=CO['year'], y=CO['CO'], mode='lines+markers', name='CO') trace4=go.Scatter(x=PM['year'], y=PM['PM2.5'], mode='lines+markers', name='PM2.5') fig = plotly.tools.make_subplots(rows=2, cols=2,print_grid=False, subplot_titles=('SO2 in diff. years','NO2 in diff. years','CO in diff. years', 'PM2.5 in diff. years')) fig.append_trace(trace1, 1, 1) fig.append_trace(trace2, 1, 2) fig.append_trace(trace3, 2, 1) fig.append_trace(trace4, 2, 2) fig['layout'].update(height=550, width=850,title='AIR Pollutants In different Years',showlegend=False) iplot(fig) In the above lines of code, we have drawn for 4 pollutants changes wrt year. We can also draw the all in a single graph also. Let's have a look at that. fig=go.Figure() fig.add_trace(go.Scatter(x=SO2['year'], y=SO2['SO2'], mode='lines+markers', name='SO2',line=dict(color='Blue', width=2))) fig.add_trace(go.Scatter(x=NO2['year'], y=NO2['NO2'], mode='lines+markers', name='NO2',line=dict(color='Red', width=2))) fig.add_trace(go.Scatter(x=BTX['year'], y=BTX['BTX'], mode='lines+markers', name='BTX',line=dict(color='Green', width=2))) fig.add_trace(go.Scatter(x=CO['year'], y=CO['CO'], mode='lines+markers', name='CO',line=dict(color='orange', width=2))) fig.add_trace(go.Scatter(x=PM['year'], y=PM['PM2.5'], mode='lines+markers', name='PM2.5',line=dict(color='Magenta', width=2))) fig.add_trace(go.Scatter(x=O['year'], y=O['O3'], mode='lines+markers', name='Ozone',line=dict(color='royalblue', width=2))) fig.update_layout(title='AIR POLLUTANTS PARTICLES IN DIFFERENT YEARS', xaxis_tickfont_size=14,yaxis=dict(title='TOTAL AMOUNT IN YEARS')) fig.show() PIE PLOT: Let's have a look at the below code to make a pie plot: x = df_Ahmedabad_2019 y = df_Bengaluru_2019 z = df_Hyderabad_2019 data = [go.Scatterpolar( r = [x['SO2'].values[0],x['NO2'].values[0],x['CO'].values[0],x['BTX'].values[0],x['PM2.5'].values[0]], theta = ['SO2','NO2','CO','BTX','PM2.5'], fill = 'toself', opacity = 0.8, name = "Ahmedabad"), go.Scatterpolar( r = [y['SO2'].values[0],y['NO2'].values[0],y['CO'].values[0],y['BTX'].values[0],y['PM2.5'].values[0]], theta = ['SO2','NO2','CO','BTX','PM2.5'], fill = 'toself',subplot = "polar2", name = "Bengaluru"), go.Scatterpolar( r = [z['SO2'].values[0],z['NO2'].values[0],z['CO'].values[0],z['BTX'].values[0],z['PM2.5'].values[0]], theta = ['SO2','NO2','CO','BTX','PM2.5'], fill = 'toself',subplot = "polar3", name = "Hyderbad")] layout = go.Layout(title = "Comparison Between Ahmedabad,Bengaluru,Hyderabad in the year 2019", polar = dict(radialaxis = dict(visible = True,range = [0, 120]), domain = dict(x = [0, 0.27],y = [0, 1])), polar2 = dict(radialaxis = dict(visible = True,range = [0, 60]), domain = dict(x = [0.35, 0.65],y = [0, 1])), polar3 = dict(radialaxis = dict(visible = True,range = [0, 70]), domain = dict(x = [0.75, 1.0],y = [0, 1])),) fig = go.Figure(data=data, layout=layout) iplot(fig) Distribution plot: Let's check the AQI distribution of 5 major cities fig,ax=plt.subplots(figsize=(20, 10)) sns.despine(fig, left=True, bottom=True) sns.set_context("notebook", font_scale=2, rc={"lines.linewidth": 2}) sns.distplot(df_Delhi['AQI'].iloc[::30], color="y",label = 'Delhi') sns.distplot(df_Ahmedabad['AQI'].iloc[::30], color="b",label = 'Ahmedabad') sns.distplot(df_Hyderabad['AQI'].iloc[::30], color="black",label = 'Hyderabad') sns.distplot(df_Bengaluru['AQI'].iloc[::30], color="g",label = 'Bengaluru') sns.distplot(df_Kolkata['AQI'].iloc [::30], color="r",label = 'Kolkata') labels = [item.get_text() for item in ax.get_xticklabels()] ax.set_xticklabels(ax.get_xticklabels(labels), rotation=30,ha="left") plt.rcParams["xtick.labelsize"] = 15 ax.set_title('AQI DISTRIBUTIONS FROM DIFFERENT CITIES') ax.legend(fontsize = 14); Go Scatter Plot: Let's trace a scatter+line plot for the city Kolkata. fig=go.Figure() fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['SO2'], mode='lines', name='SO2',line=dict(color='Blue', width=2))) fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['NO2'], mode='lines', name='NO2',line=dict(color='Red', width=2))) fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['BTX'], mode='lines', name='BTX',line=dict(color='Green', width=2))) fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['CO'], mode='lines', name='CO',line=dict(color='orange', width=2))) fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['PM2.5'], mode='lines', name='PM2.5',line=dict(color='Magenta', width=2))) fig.add_trace(go.Scatter(x=df_Kolkata_2020['Date'], y=df_Kolkata_2020['O3'], mode='lines', name='Ozone',line=dict(color='royalblue', width=2))) fig.update_layout(title='AIR POLLUTANTS PARTICLES ON 2020 Kolkata', xaxis_tickfont_size=14,yaxis=dict(title='AIR POLLUTANTS')) fig.show() So these are all the advanced plots discussed in this blog. For code refer this link: https://github.com/mona2401/Impact-of-Air-Pollution-Before-Lockdown-vs-After-lockdown Thanks for Reading! Happy Coding ;)

  • Research Paper Implementation : Transfer Learning with Deep Convolutional Neural Network.

    ABSTRACT Tremendous progress has been made in object recognition with deep convolutional neural networks (CNNs), thanks to the availability of large-scale annotated dataset. With the ability of learning highly hierarchical image feature extractors, deep CNNs are also expected to solve the Synthetic Aperture Radar (SAR) target classification problems. However, the limited labeled SAR target data becomes a handicap to train a deep CNN. To solve this problem, we propose a transfer learning based method, making knowledge learned from sufficient unlabeled SAR scene images transferrable to labeled SAR target data. We design an assembled CNN architecture consisting of a classification pathway and a reconstruction pathway, together with a feedback bypass additionally. Instead of training a deep network with limited dataset from scratch, a large number of unlabeled SAR scene images are used to train the reconstruction pathway with stacked convolutional auto-encoders (SCAE) at first. Then, these pre-trained convolutional layers are reused to transfer knowledge to SAR target classification tasks, with feedback bypass introducing the reconstruction loss simultaneously. The experimental results demonstrate that transfer learning leads to a better performance in the case of scarce labeled training data and the additional feedback bypass with reconstruction loss helps to boost the capability of classification pathway. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • A ML Solution for Combatting COVID-19 in Smart Cities from Multiple Dimensions.

    ABSTRACT The spread of COVID-19 across the world continues as efforts are being made from multi-dimension to curtail its spread and provide treatment. The COVID-19 triggered partial and full lockdown across the globe in an effort to prevent its spread. COVID-19 causes serious fatalities with United States of America recording over 3,000 deaths within 24 hours, the highest in the world for a single day. In this paper, we propose a framework integrated with machine learning to curtail the spread of COVID-19 in smart cities. A novel mathematical model is created to show the spread of the COVID-19 in smart cities. The proposed solution framework can generate, capture, store and analyze data using machine learning algorithms to detect, prevent the spread of COVID-19, forecast next epidemic, effective contact tracing, diagnose cases, monitor COVID-19 patient, COVID-19 vaccine development, track potential COVID-19 patients, aid in COVID-19 drug discovery and provide better understanding of the virus in smart cities. The study outlined case studies on the application of machine learning to help in the fight against COVID-19 in hospitals in smart cities across the world. The framework can provide a guide for real world execution in smart cities. The proposed framework has the potential for helping national healthcare systems in curtailing the COVID-19 pandemic in smart cities. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • Security Attacks, Multi-Factor Authentication In PHP

    This assessment is relevant to the following Learning Outcomes: Explain the range of threats to e-commerce security. Explain how cryptography can be, and is, used to achieve security. Describe the different standards in use for secure electronic commerce, such as certificates, MACs, etc. Describe the different protocols in use for secure electronic commerce, such as SSL / TLS. Q1. Security Attacks on E-Commerce Websites Alice owns a computer store in Melbourne city. In order to increase the sales, she has developed an E-Commerce application for her computer store. Some of the well-known attacks on E-commerce websites are as follows: Cross-Site Scripting (XSS) SQL injections Hidden field manipulation Fishing Attack Cookie poisoning Web scraping Layer 7 DoS attacks Parameter tampering Buffer overflow Backdoor or Debug options Stealth commanding Forced browsing Third-party misconfigurations Alice realizes that the E-Commerce application must be secured before it becomes online. From that realization, she hires you and your team as a security consultant to identify the security risks of her developed E-Commerce application. Create an E-commerce website (with a database as back-end and other necessary tools such as HTML, PHP, Javascript, CSS files etc.) for yourself to demonstrate the chosen attacks. However, for the sake of convenience, sample code of Alice’s E-Commerce application (includes HTML, PHP, JavaScript, and CSS source files) and the database (as SQL file) are uploaded in the CANVAS under the Assignment-1 home page. You should add or edit pages to whenever required. Create a group of 3 people. Then, you are required to configure Alice’s E-Commerce application in your personal computer or any free websites (where you can host your website) using the knowledge you have learned from Tutorials 1 to 4. Once you have configured the application, you are required to demonstrate at least three types of attacks that can be performed on Alice’s E-Commerce application. For each of the attack, you need to do the followings: a) Write down all the necessary steps to launch each attack with screenshots. b) Record the steps in a video and post it in the CANVAS or YouTube (as a private video). Provide the link. You should not share the link of the video any of your peer groups. Provide the above items mentioned in (a) and (b) as a group. Q2. Securing E-Commerce Website from spam and abuse In the E-Commerce application that has been provided in the CANVAS in relation to Q1, only registered users should be authorized to login to the Ecommerce application and trade. A registered user can be either a seller or buyer who needs to create a user account. It is possible that several fake users are created by human attackers or software bots for hampering the operation of the E-Commerce application. To protect the E-Commerce application from spam and abuse, Alice requests you to integrate CAPTCHA in her E-Commerce application. Considering the security strength of Google’s reCAPTCHA service, you have decided to integrate it in Alice’s application. a) From the knowledge you have learnt in Tutorial, implement Google’s reCAPTCHA version 2: i. Design a form similar to the one given in Figure-2.1 to create user account with Google’s reCAPTCHA version 2. ii. Show step by step processes, with appropriate code segments and screenshots, how Google’s reCAPTCHA version 2 can be applied in the E-Commerce application to prevent creating fake user accounts. Also, record the steps in a video and post it in the CANVAS or YouTube (as a private video) and provide the link. You should not share the link of the video any of your peer groups. Expected User Registration Page enabled with Google’s reCAPTCHA version 2 b) You have found that Google has a new version of its reCAPTCHA which is reCAPTCHA version 3. When you informed Alice about the reCAPTCHA version 3, she is convinced that reCAPTCHA version 3 is better. To make Alice happy: i. Design a form similar to the one shown in Figure-2.2 to create user accounts with Google’s reCAPTCHA version 3. ii. Show step by step processes, with appropriate code segments and screenshots, how Google’s reCAPTCHA version 3 can be applied in the E-Commerce application to prevent creating fake user accounts. Also, record the steps in a video and post it in the CANVAS or YouTube (as a private video) and provide the link. You should not share the link of the video any of your peer groups. iii. What are the advantages of using reCAPTCHA version 3? User Registration Page enables with Google’s reCAPTCHA version 3 Q3. Simple Multi-Factor Authentication Once user accounts have been created, only valid users should be allowed to login and trade using Alice’s E-Commerce application. However, attackers can still compromise the login system with the aid some sophisticated software. So, you have decided to integrate the multi-factor authentication in Alice’s E-Commerce application. Develop an Email-based multi-factor authentication for Alice’s E-Commerce application that has the following requirements. Also, record the steps in a video and post it in the CANVAS or YouTube (as a private video) and provide the link. You should not share the link of the video any of your peer groups. Requirements: i. Create a simple login form as shown in Figure. When a user provides a valid email (your RMIT student email) and password (e.g. 1234), the user should receive a 6-digit random number in his/her email address as shown in Figure and the page to be shown as presented in Figure. ii. Once the verification code is provided in the form shown in Figure, the code should be verified, and the Success Page is shown (see Figure). Otherwise, the Failure Page is shown (see Figure). Login Form for Email-based Two Factor Authentication Email containing the 6-digit Two Factor Authentication code Form to Enter the Verification Code Success Page shown if a valid code is entered Failure Page shown if an invalid code is entered Q4. Advanced Multi-Factor Authentication Once user accounts have been created, only valid users should be allowed to login and trade using Alice’s E-Commerce application. However, attackers can still compromise the login system by performing the password guessing attack. To prevent an attacker getting access to the application by simply knowing the password, you have decided to integrate the multi-factor authentication in Alice’s E-Commerce application a) Apply Google’s 2-step verification (e.g.2FA, also called 2 Factor Authentication or 2FA) to user accounts of the E-Commerce application. You need to perform the followings: i. Create a login form (as shown in Figure-4.1) that would allow you to enter Email and password. Next, provide steps with necessary code segment and screenshots how you have integrated Goggle’s 2FA in Alice’s E-Commerce application. Also, record the steps in a video and post it in the CANVAS or YouTube (as a private video) and provide the link. You should not share the link of the video any of your peer groups. Login Form with Google’s 2 Factor Authentication iii. Once a user enters correct email and password, a screen (like Figure-4.2 or 4.3) should prompt the user to enter 2-step verification code as follows: Google’s Form to enter verification code in Google’s 2 Factor Authentication Another Google’s Form to enter verification code in Google’s 2 Factor Authentication b) Design an SMS-based two factor authentication (2FA) framework and show step-by-step process to implement it in Alice’s E-Commerce application. In your designed 2FA framework, the E-commerce website should send an SMS to the verified user’s mobile phone number each time a user provides valid username and password. The verification code should be a unique short-lived code. Figure-4.4 shows an overview of the system. Show steps with necessary code segment and screenshots. Also, record the steps in a video and post it in the CANVAS or YouTube (as a private video) and provide the link. You should not share the link of the video any of your peer groups. Overview of SMS-based 2 Factor Authentication system Are you looking to add authentication in our web or mobile application to make it more secure, or need a complete application with authentication then you can contact us here: contact@codersarts.com

  • Research Paper Implementation : A Review on Support Vector Machine for Data Classification.

    ABSTRACT With increasing amounts of data being generated by businesses and researchers there is a need for fast, accurate and robust algorithms for data analysis. Improvements in databases technology, computing performance and artificial intelligence have contributed to the development of intelligent data analysis. Support vector machines are a specific type of machine learning algorithm that are among the most widely used for many statistical learning problems, such as spam filtering, text classification, handwriting analysis, face and object recognition, and countless others. Support vector machines have also come into widespread use in practically every area of bioinformatics within the last ten years, and their area of influence continues to expand today. The support vector machine has been developed as robust tool for classification and regression in noisy, complex domains. The two key features of support vector machines are generalization theory, which leads to a principled way to choose an hypothesis; and, kernel functions, which introduce non-linearity in the hypothesis space without explicitly requiring a non-linear algorithm. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • Research Paper Implementation : A Survey on Deep Learning: Algorithms, Techniques, and Applications

    ABSTRACT The field of machine learning is witnessing its golden era as deep learning slowly becomes the leader in this domain. Deep learning uses multiple layers to represent the abstractions of data to build computational models. Some key enabler deep learning algorithms such as generative adversarial networks, convolutional neural networks, and model transfers have completely changed our perception of information processing. However, there exists an aperture of understanding behind this tremendously fast-paced domain, because it was never previously represented from a multiscope perspective. The lack of core understanding renders these powerful methods as black-box machines that inhibit development at a fundamental level. Moreover, deep learning has repeatedly been perceived as a silver bullet to all stumbling blocks in machine learning, which is far from the truth. This article presents a comprehensive review of historical and recent state-of-theart approaches in visual, audio, and text processing; social network analysis; and natural language processing, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications. It was also undertaken to review the issues faced in deep learning such as unsupervised learning, black-box models, and online learning and to illustrate how these challenges can be transformed into prolific future research avenues. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • Research Paper Implementation : Twitter Sentiment Analysis.

    ABSTRACT With the evolving behaviour of different types of social networking sites like Instagram, twitter, snap chat etc , the data posted by people i.e the users of a particular social site is increasing drastically . So much so that almost millions and billions of data may it be textual, video or audio is posted per day. This is because there are millions of users of a particular site. These users intend to share their thoughts, views related to any topic of their choosing. Some of these users even post in vain. These posts are short hence only meant to express a particular view of a particular user regarding a particular thing. In this paper we aim to derive the feelings behind these posts. For this we have chosen twitter as a social networking site. The posts in this social networking site are known as tweets. In this paper we scrutinise methods of preprocessing and extraction of twitter data using python and then train as well as test this data against a classifier in order to derive the sentiments behind tweets. To download full research paper click on the link below. If you need implementation of this research paper or any of its variants, feel free contact us on contact@codersarts.com.

  • Employee Management System Using JavaFx

    Project title: “Crud Operation using JavaFX” Project Description: In this example, we are seeing how to develop CRUD (Create, Read, Update, and Delete) operation applications using technologies JavaFX. In this example, all operations are performed on Employee basic properties like employee id, employee name, department name, mobile number, and employee salary. Application main aim is adding employee details to DB using the user interface, and performing multiple operations like update, viewing, and deleting. Home Screen: Update Employee Data: Register New Employee Details: If you need a solution or code script of this project or other readymade project related to JavaFx then you can contact us at: contact@codersarts.com

bottom of page