Search Results
737 results found with an empty search
- Apache Spark Assignment Help | Machine Learning Using PySpark
What is PySpark? PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. Using PySpark, one can easily integrate and work with RDDs in Python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets. Whether it is to perform computations on large datasets or to just analyze them, Data Engineers are switching to this tool. Key Features of PySpark Real-time computations: Because of the in-memory processing in the PySpark framework, it shows low latency. Polyglot: The PySpark framework is compatible with various languages such as Scala, Java, Python, and R, which makes it one of the most preferable frameworks for processing huge datasets. Caching and disk persistence: This framework provides powerful caching and great disk persistence. Fast processing: The PySpark framework is way faster than other traditional frameworks for Big Data processing. Works well with RDDs: Python programming language is dynamically typed, which helps when working with RDDs. Spark with Python vs Spark with Scala As it is already discussed, Python is not the only programming language that can be used with Apache Spark. Data Scientists already prefer Spark because of the several benefits it has over other Big Data tools, but choosing which language to use with Spark is a dilemma that they face. Being one of the most popular frameworks when it comes to Big Data Analytics, Python has gained so much popularity that you wouldnât be shocked if it became the de-facto framework for evaluating and dealing with large datasets and Machine Learning in the coming years. The most used programming languages with Spark are Python and Scala. Now if you are going to learn PySpark (Spark with Python), then it is important that you know why and when to use Spark with Python, instead of Spark with Scala. In this section, the basic criteria, one should keep in mind while making the choice between Python and Scala to work on Apache Spark, are explained. Installation In Window: In this section, you will come to know how to install PySpark on Windows systems step by step. Download the latest version of Spark from the official Spark website What is SparkConf? Before running any Spark application on a local cluster or on a dataset, you need to set some configurations and parameters. This is done with the help of SparkConf. As the name suggests, it offers configurations for any Spark application. Features of SparkConf and Their Uses Here is a list of some of the most commonly used features or attributes of SparkConf while working with PySpark: set(key, value): This attribute is used to set a configuration property. setMaster(value): This attribute is used to set the master URL. setAppName(value): This attribute is used to set an application name. get(key, defaultValue=None): This attribute is used to get a configuration value of a key. setSparkHome(value): This attribute is used to set the Spark installation path. Code to run SparkConf >>> from pyspark.conf import SparkConf >>> from pyspark.context import SparkContext >>> conf = SparkConf().setAppName("PySpark App").setMaster("local[2]") >>> conf.get("spark.master") >>> conf.get("spark.app.name") What is PySpark SparkContext? SparkContext is the entry gate for any Spark-derived application or functionality. It is the first and foremost thing that gets initiated when you run any Spark application. In PySpark, SparkContext is available as sc by default, so creating a new SparkContext will throw an error. Parameters SparkContext has some parameters that are listed below: Master: The URL of the cluster SparkContext connects to AppName: The name of your job SparkHome: A Spark installation directory PyFiles: The .zip or .py files send to the cluster and then added to PYTHONPATH Environment: Worker node environment variables BatchSize: The number of Python objects represented. However, to disable batching, set the value to 1; to automatically choose the batch size based on the object size, set it to 0; and to use an unlimited batch size, set it to â1 Serializer: This parameter tells about an RDD serializer Conf: An object of L{SparkConf} to set all Spark properties profiler_cls: A class of custom profilers used to do profiling; however, pyspark.profiler.BasicProfiler is the default one Code to Run SparkContext: from pyspark import SparkContext sc = SparkContext("local", "First App") Classes of Spark SQL and DataFrames: pyspark.sql.SparkSession Main entry point for DataFrame and SQL functionality. pyspark.sql.DataFrame A distributed collection of data grouped into named columns. pyspark.sql.Column A column expression in a DataFrame. pyspark.sql.Row A row of data in a DataFrame. pyspark.sql.GroupedData Aggregation methods, returned by DataFrame.groupBy(). pyspark.sql.DataFrameNaFunctions Methods for handling missing data (null values). pyspark.sql.DataFrameStatFunctions Methods for statistics functionality. pyspark.sql.functions List of built-in functions available for DataFrame. pyspark.sql.types List of data types available. pyspark.sql.Window For working with window functions. Analyze Data using Spark SQL Relational databases are used by almost all organizations for various tasks â from managing and tracking a huge amount of information to organizing and processing transactions. Itâs one of the first concepts we are taught in coding school. And letâs be grateful for that because this is a crucial cog in a data scientistâs skillset! You simply cannot get by without knowing how databases work. Itâs a key aspect of any machine learning project. Structured Query Language (SQL) is easily the most popular language when it comes to databases. Unlike other programming languages, it is easy to learn and helps us start with our data extraction process. For most of the data science jobs, proficiency in SQL ranks higher than most other programming languages. Features of Spark SQL Spark SQL has a ton of awesome features but I wanted to highlight a few key ones that youâll be using a lot in your role: Query Structure Data within Spark Programs: Most of you might already be familiar with SQL. Hence, you are not required to learn how to define a complex function in Python or Scala to use Spark. You can use the exact same query to get the results for your bigger datasets! Compatible with Hive: Not only SQL, but you can also run the same Hive queries using the Spark SQL Engine. It allows full compatibility with current Hive queries One Way to Access Data: In typical enterprise-level projects, you do not have a common source of data. Instead, you need to handle multiple types of files and databases. Spark SQL supports almost every type of file and gives you a common way to access a variety of data sources, like Hive, Avro, Parquet, JSON, and JDBC Performance and Scalability: While working with large datasets, there are chances that faults might occur between the time while the query is running. Spark SQL supports full mid-query Fault Tolerance so we can work with even a thousand nodes simultaneously User-Defined Functions: UDF is a feature of Spark SQL that defines new column-based functions that extend the vocabulary of Spark SQL for transforming datasets Executing SQL Commands with Spark I have created a random dataset of 25 million rows. You can download the entire dataset here. We have a text file with comma-separated values. So, first, we will import the required libraries, read the dataset, and see how Spark will divide the data into partitions: # importing required libraries from pyspark.sql import SQLContext from pyspark.sql import Row # read the text data raw_data=sc.textFile('sample_data_final_wh.txt').cache() How to Manage Python Dependencies in PySpark There are different methods which is used to manage the dependencies in PySpark, Which is given below: Using Conda Conda is one of the most widely-used Python package management systems. PySpark users can directly use a Conda environment to ship their third-party Python packages by leveraging conda-pack which is a command line tool creating relocatable Conda environments. Using Virtualenv Virtualenv is a Python tool to create isolated Python environments. Since Python 3.3, a subset of its features has been integrated into Python as a standard library under the venv module. In the upcoming Apache Spark 3.1, PySpark users can use virtualenv to manage Python dependencies in their clusters by using venv-pack in a similar way as conda-pack. In the case of Apache Spark 3.0 and lower versions, it can be used only with YARN. Using PEX PySpark can also use PEX to ship the Python packages together. PEX is a tool that creates a self-contained Python environment. This is similar to Conda or virtualenv, but a .pex file is executable by itself. Contact Us to Get PySpark Assignment Help, PySpark Homework Help, PySpark Project Help, and get an instant help with an affordable price, you can send your request directly at contact@codersarts.com
- Apache Spark
What is Apache spark? Apache Spark is an open-source distributed general-purpose cluster-computing framework which provides a number of inter-connected platforms, systems and standards for Big Data projects. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. In simpler words, it can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or together with other distributed computing tools. It utilizes in-memory caching (i.e. RAM rather than disk space) and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing. These two qualities are keys to the world of big data and machine learning, which require the assembling of massive computing power to crunch through large data stores. It was originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Apache Spark Ecosystem: Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. Apache Spark Core â Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems. Spark SQL â Spark SQL is Apache Sparkâs module for working with structured data. The interfaces offered by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Spark Streaming â This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS (Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards. MLlib (Machine Learning Library) â Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms- classification, regression, clustering, and collaborative filtering. It also includes other tools for constructing, evaluating, and tuning ML Pipelines. All these functionalities help Spark scale out across a cluster. GraphX â Spark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL (Extract, Transform, and Load) process, exploratory analysis, and iterative graph computation within a single system. Architecture At a foundational level, an Apache Spark application comprises of two main components: a driver, which converts the user's code into multiple tasks that can be distributed across worker nodes, and executors, which run on those nodes and execute the tasks assigned to them. Some form of cluster manager is necessary to mediate between the two. Moreover, Spark can run in a standalone cluster mode that just requires the Apache Spark framework and a JVM on each machine in your cluster. However, it would be much better to take advantage of a more robust resource or cluster management system to take care of allocating workers on demand for you. In the enterprise, this will normally mean running on Hadoop YARN, but Apache Spark can also run on Apache Mesos, Kubernetes, and Docker Swarm. If you seek a managed solution, then Apache Spark can be found as part of Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight. Databricks, the company that employs the founders of Apache Spark, also offers the Databricks Unified Analytics Platform, which is a comprehensive managed service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and optimized cloud I/O performance over a standard Apache Spark distribution. Apache Spark builds the userâs data processing commands into a Directed Acyclic Graph, or DAG. The DAG is Apache Sparkâs scheduling layer; it determines what tasks are executed on what nodes and in what sequence. Features Fast processing â The most important feature of Apache Spark that has made the big data world choose this technology over others is its speed. Big data is characterized by volume, variety, velocity, and veracity which needs to be processed at a higher speed. Spark contains Resilient Distributed Dataset (RDD) which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop. Flexibility â Apache Spark supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python. In-memory computing â Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed of analytics. Real-time processing â Spark is able to process real-time streaming data. Unlike MapReduce which processes only stored data, Spark is able to process real-time data and is, therefore, able to produce instant outcomes. Better analytics â In contrast to MapReduce that includes Map and Reduce functions, Spark includes much more than that. Apache Spark consists of a rich set of SQL queries, machine learning algorithms, complex analytics, etc. With all these functionalities, analytics can be performed in a better fashion with the help of Spark. Ease of use - Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
- Big Data Analysis with PySpark | Sample Assignment.
BACKGROUND: INCOME CLASSIFIER Census data is one of the largest sources of a variety of statistical information related to population. It typically includes information related to Age, Gender, Household composition, Employment Details, Accommodation Details, and so on. Till recent years, collecting census data has been a manual process involving field visits and registrations. With advances in technology, the methods of collecting this data have improved to a great extent. And so is the population! With a population of more than 7 billion, one can imagine the volume of the census data associated with it. This data is collected from a variety of sources such as manual entries, online surveys, data from social media and search engines and is in various formats. Traditional database systems are inefficient at handling such data. This is where Big Data Technologies come into picture. As per a study by U.S. Census Bureau, analytics on census data could have been helpful during the Great Recession in various ways such as avoiding job loss in Supply-Chain businesses, reducing housing foreclosure rates, and so on. Big Data Analytics refers to a set of tools and methods used to obtain knowledge from information. Application of Big Data Analytics on census data can facilitate better decision making in various Government and Industrial sectors such as Healthcare, Education, Finance, Retail, and Housing. One such application is an Income Classifier. In this project, let us take a sample of world census data and build an Income Classifier using various Big Data Techniques described in subsequent sections. LEARNING OBJECTIVES 1. HDFS and Hive for Data Storage and Management 2. Data Ingestion using Sqoop 3. Machine Learning using PySpark This Project is divided into three parts to cover the above learning objectives. DATASET The dataset named censusdata.csv is provided in your LMS. We will be using the same dataset for all the three parts. Input: The dataset contains 15 columns Targeted Column: Income; the income is provided in the form of two values: < 50k or >50k Number of other columns:14; these are demographics and other features used for describing a person List of Attributes: age: continuous workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education-num: continuous marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof- specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black sex: Female, Male capital-gain: continuous capital-loss: continuous hours-per-week: continuous native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands income: >50K, <=50K TASKS 1. HDFS AND HIVE Problem Statement 1 Census Analytics is a project where you need to collect the data of people along with their incomes. As the census data is usually in large volume, the analysis of the data will be a cumbersome task. To overcome this, we will be using the Hadoop Ecosystem. As a first step, you need to load the data into HDFS and create a table in Hive that can be used for querying the data. You have to create different types of tables and execute queries, as mentioned below and compare the time required for execution for different types of tables. Steps to be performed: 1. Download the dataset named censusdata.csv that is provided in your LMS 2. Load the downloaded data into HDFS 3. Create an internal table in Hive to store the data a. Create the table structure b. Load the data from HDFS into the Hive table 4. Create an internal table in Hive with partitions a. Create a Partition Table in Hive using âworkclassâ as the Partition Key b. Load data from the staging table (Table created in Step 3) into this table 5. Create an external table in Hive to hold the same data stored in HDFS 6. Create an external table in Hive with partitions using âworkclassâ as Partition Key 7. For each of the four tables created above, perform the following operations Find out the number of adults based on income and gender. Note the time taken for getting the result Find out the number of adults based on income and workclass. Note the time taken for getting the result Write your observations by comparing the time taken for executing the commands between: a. Internal & External Tables b. Partitioned & Non-partitioned Tables 8. Delete the internal as well as external tables. Comment on the effect on data and metadata after the deletion is performed for both internal and external tables. 2. DATA INGESTION Problem Statement 2 In a similar scenario as above, the data is available in a MySQL database. Due to the inefficiency of RDBMS systems to store and analyze Big Data, it is recommended that we move the data to the Hadoop Ecosystem. Ingest the data from MySQL database into Hive using Sqoop. Data pipeline needs to be created to ingest data from an RDBMS into Hadoop Cluster and then load data into Hive. To make the analysis faster, use Spark on top of Hive after getting data into the Hadoop cluster. Using Spark, query different tables from Hive to analyze the dataset. Steps to be performed: 1. Create the necessary structure in a MySQL database using the steps mentioned below: a. Create a new database in MySQL with the name midproject b. Create a table in this database with the name census_adult to store the input dataset c. Load the dataset into the table d. Verify whether data is loaded properly e. Verify the table for unwanted data such as â?â,âNanâ and âNullâ f. Get the counts for the columns which contain unwanted data g. Clean the data by replacing the unwanted data with others 2. Import the above data from MySQL into a Hive table using Sqoop 3. Connect to PySpark using web console to access the created Hive table. Perform the following queries and note the time taken for execution in each of the queries. a. Query the table to get the number of adults based on income and gender b. Query the table to get the number of adults based on income and workclass Hint: To access Hive tables using Spark console, use the following commands: >>pyspark2 >>from pyspark.context import SparkContext >>from pyspark.sql import HiveContext >>sqlContext = HiveContext(sc) 4. Access the following two tables created as part of Problem 1 (HDFS and Hive) and perform the steps as mentioned below: a. Access Hive External Table with partition i. Query the table to get the number of adults based on income and gender ii. Query the table to get the number of adults based on income and workclass b. Access Hive Internal Table with Partition i. Query the table to get the number of adults based on income and gender ii. Query the table to get the number of adults based on income and workclass Make a note of the time taken for getting the result in comparison with the time taken to get results with Hive. 5. Comment on the time taken for executing these commands using Spark as compared to the time taken for execution in Hive (Problem Statement 1). 3. INCOME CLASSIFIER Problem Statement 3 Income Classifier is an application that will be used to classify individuals based on the annual income. An individualâs annual income may be influenced by various factors such as age, gender, occupation, education level, and so on. Write a program to build classification models using PySpark. Explore the possibility of classifying income based on an individualâs personal information. Perform the following steps to build and compare different classifiers. Use Jupyter Notebook to write the program. Steps to be performed: 1. Load data using PySpark 2. Perform Exploratory Data Analysis (EDA) and Data Cleaning based on the following points: a. Find the shape and schema of the dataset b. Obtain insights (statistics) of different columns c. Obtain the Unique values of Categorical Columns d. Check if any unwanted values are present in the data such as Null, ? or NaN e. Remove unwanted values if present in any of the columns (numerical as well as categorical columns) f. Obtain the relationship between different columns using covariance which shows the degree of interdependence of the two columns g. Obtain distinct values and their counts in categorical columns. h. Create a crosstab on two different columns (example, age & workclass) i. Perform an âInteger Type Checkâ on the columns of the Spark DataFrame and display the columns satisfying the same j. Obtain correlation between the above columns using pandas scatter plot 3. Data Preprocessing Since we are going to use classification algorithms like Logistic Regression, we will have to convert all the categorical columns in the dataset to numerical values. We can achieve this using 1) Category Indexing In this, we assign a numerical value to each category (eg: Male: 0, Female: 1) 2) One-Hot Encoding a. Conversion of categorical columns into Numerical Columns i. Category Indexing using string indexing for all categorical columns ii. Label Indexing for income column as income_class iii. One Hot Encoding which generates binary columns for features iv. Use Vector assembler to get a single vector column for features v. Make it as an array of stages so that it can be passed to a pipeline This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0])) In this step, we will be using a combination of Category Indexing and One-Hot Encoding (Note: Make sure that the output column name for Income should be income_class) 4. Build the Pipeline to perform multiple tasks a. Pass the stages of Data Preprocessing (created in Step 3) to the pipeline to create an instance with the stages b. Estimator that can fit on a DataFrame to produce a model c. Transform the DataFrame with features to DataFrame with predictions d. Generate a DataFrame which can hold a variety of datatypes including feature vectors 5. Split the dataset into two parts (80%-20%) as Train and Test Datasets a. Check the shape of the datasets b. Check the distribution of income class (0,1) in train and test dataset 6. Build the following Classifiers a. Logistic Regression b. Decision Tree c. Random Forest d. Gradient Boosted Tree e. NaĂŻve Bayes Common Tasks for all the Classifiers: Train and Evaluate the Model Print ROC metrics & model accuracy Tune the Hyperparameters and print the improved accuracy Compare the accuracy of the 5 models and comment on the models which performed better as compared to other in the list. If you need solution for this assignment or have project a similar assignment, you can leave us a mail at contact@codersarts.com directly.
- Student Scheduler | A Student Progress Tracking App | CodersArts
This App Help parents and Teachers to track progress report of student by using this App In this app we have added features like student information teacher information and student parent details into this App. Teacher can assign the student Work and his/her progress and update student progress day by day, then teacher can read the information about the student. when parent or teacher open the app we go to splash activity of an app n go to the home page. 2. When student open navigation page there is a List Of Terms, Courses Term Wise, List of Courses and List Of Assessments. 3. In the assessment page there is Add New Courses Give the Information of Tittle Course Start date , Course End Date, Status , Mentor Information Mentor Mobile Number ,Email at the end on Note page Give the Student Progress details. In In the Term wise parent and teacher can track student progress Report Know where student needs improvement, Hire an android developer to get quick help for all your android app development needs. with the hands-on android assignment help and android project help by Codersarts android expert. You can contact the android programming help expert any time; we will help you overcome all the issues and find the right solution. Want to get help right now? Or Want to know price quote Please send your requirement files at contact@codersarts.com. and you'll get instant reply as soon as requirement receives
- Data Pre-Processing & Visualization with Pyth0n | Sample Assignment.
Project Details Your tasks in this project are as follows: Data wrangling, which consists of: Gathering data (downloadable file in the Resources tab in the left most panel of your classroom and linked in step 1 below). Assessing data Cleaning data Storing, analyzing, and visualizing your wrangled data Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations. Gathering Data for this Project Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb: 1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv 2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv 3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission. If you decide to complete your project in the Project Workspace, note that you can upload files to the Jupyter Notebook Workspace by clicking the "Upload" button in the top righthand corner of the dashboard. Assessing Data for this Project After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed. Cleaning Data for this Project Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned. Storing, Analyzing, and Visualizing Data for this Project Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do). Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced. Reporting for this Project Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document. Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example. Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however. If you are need solution of these type of problems then you can contact us at conact@codersarts.com and get instant help.
- Income Classifier Using HDFS and Hive
BACKGROUND: INCOME CLASSIFIER Census data is one of the largest sources of a variety of statistical information related to population. It typically includes information related to Age, Gender, Household composition, Employment Details, Accommodation Details, and so on. Till recent years, collecting census data has been a manual process involving field visits and registrations. With advances in technology, the methods of collecting this data have improved to a great extent. And so is the population! With a population of more than 7 billion,one can imagine the volumeof the census data associated with it. This data is collected from a variety of sources such as manual entries, online surveys, data from social media and search engines and is in various formats. Traditional database systems are inefficient at handling such data. This is where Big Data Technologies come into picture. As per a study by U.S. Census Bureau, analytics on census data could have been helpful during the Great Recession in various ways such as avoiding job loss in Supply-Chain businesses, reducing housing foreclosure rates, and so on. Big Data Analytics refers to a set of tools and methods used to obtain knowledge from information. Application of Big Data Analytics on census data can facilitate better decision making in various Government and Industrial sectors such as Healthcare, Education, Finance, Retail, and Housing. One such application is an Income Classifier. In this project, let us take a sample of world census data and build an Income Classifier using various Big Data Techniques described in subsequent sections. LEARNING OBJECTIVES 1. HDFS and Hive for Data Storage and Management 2. Data Ingestion using Sqoop 3. Machine Learning using PySpark This Project is divided into three parts to cover the above learning objectives. DATASET The dataset named censusdata.csv is provided in your LMS. We will be using the same dataset for all the three parts. Input: The dataset contains 15 columns Targeted Column: Income; the income is provided in the form of two values: <=50k or >50k Number of other columns: 14; these are demographics and other features used for describing a person List of Attributes: age: continuous workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education-num: continuous marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof- specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other,Black sex: Female, Male capital-gain: continuous capital-loss: continuous hours-per-week: continuous native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands income: >50K, <=50K TASKS 1. HDFS AND HIVE Problem Statement 1 Census Analytics is a project where you need to collect the data of people along with their incomes. As the census data is usually in large volume, the analysis of the data will be a cumbersome task. To overcome this, we will be using the Hadoop Ecosystem. As a first step, you need to load the data into HDFS and create a table in Hive that can be used for querying the data. You have to create different types of tables and execute queries, as mentioned below and compare the time required for execution for different types of tables. Steps to be performed: 1. Download the dataset named censusdata.csv that is provided in your LMS 2. Load the downloaded data into HDFS 3. Create an internal table in Hive to store the data a. Create the table structure b. Load the data from HDFS into the Hive table 4. Create an internal table in Hive with partitions a. Create a Partition Table in Hiveusing âworkclassâ as the Partition Key b. Load data from the staging table (Table created in Step 3) into this table 5. Create an external table in Hive to hold the same data stored in HDFS 6. Create an externaltable in Hivewith partitions usingâworkclassâ as Partition Key 7. For each of the four tables created above, perform the following operations Find out the number of adults based on income and gender. Note the time taken for getting the result Find out the number of adults based on income and workclass. Note the time taken for getting the result Write your observations by comparing the time taken for executing the commands between: a. Internal & External Tables b. Partitioned & Non-partitioned Tables 8. Delete the internal as well as external tables. Comment on the effect on dataand metadata after the deletion is performed for both internal and external tables. 2. DATA INGESTION Problem Statement 2 In a similar scenario as above, the data is available in a MySQL database. Due to the inefficiency of RDBMS systems to store and analyze Big Data, it is recommended that we move the data to the Hadoop Ecosystem. Ingest the data from MySQL database into Hive using Sqoop. Data pipeline needs to be created to ingest data from an RDBMS into Hadoop Cluster and then load data into Hive To make the analysis faster, use Spark on top of Hive after getting data into the Hadoop cluster. Using Spark, query different tables from Hive to analyze the dataset. Steps to be performed: 1. Create the necessary structure in a MySQL database using the steps mentioned below: a. Create a new database in MySQL with the name midproject b. Create a table in this database with the name census_adult to store the input dataset c. Load the dataset into the table d. Verify whether data is loaded properly e. Verify the table for unwanted data such as â?â,âNanâ and âNullâ f. Get the counts for the columns which contain unwanted data g. Clean the data by replacing the unwanted data with others 2. Import the above data from MySQL into a Hive table using Sqoop 3. Connect to PySpark using web console to access the created Hive table. Perform the following queries and note the time taken for execution in each of the queries. Contact us to get any assignment help related to hadoop big data then you can contact us at contact@codersarts.com
- Solve Machine Learning Mathmetical Problems | Sample Assignment
1. Consider a dataset with attributes x, y, and z, where the decision attribute is z. Suppose that we have determined that there are two support vectors: the 2D point (-7, 10) which corresponds to an instance in the dataset that has x = -7, y = 10, and z = -1, and the 2D point (-6, 9) which corresponds to an instance in the dataset that has x = -6, y = 9, and z = 1. The equations for the support vector machine are shown below where s1 = (-7 10 1) is the augmented support vector for (-7, 10), s2 = (-6 9 1) is the augmented support vector for (-6, 9), and Îą1 and Îą2 are the respective parameters for the support vectors that will be used to define the 2D hyperplane. Îą1Ď(s1) ⢠Ď(s1) + Îą2Ď(s2) ⢠Ď(s1) = -1 Îą1Ď(s1) ⢠Ď(s2) + Îą2Ď(s2) ⢠Ď(s2) = 1 For Ď, use Ď(x y) = ( x+y 10-y ) a. Solve for each Îąi showing ALL of your work! (2 pts.) b. Using your results from part a., define the discriminating 2D hyperplane for this dataset; that is, give an equation for the 2D hyperplane. Show your work! (2 pts.) c. Using the support vector machine you have defined, predict the value for the decision attribute (z) for an instance that has x = 2 and y = 5. Show your work! (2 pts.) 2. Write a Python function which, given a dataframe, constructs (and returns) a NaĂŻve Bayesian network. You can assume that all of the attributes have nominal values and that the decision attribute is the last attribute in the dataframe. Apply Laplace smoothing to the conditional probabilities of the attributes (as explained in class) using a value of Îť = 1. Output the conditional probability table for each node in the Bayesian network so that your work can be checked! Test your function by running it on contact-lenses.csv AND hypothyroid.csv (both of which are posted on Canvas with this assignment). Note that you can check your work by running Classify -> weka -> Classifiers -> bayes -> NaiveBayesSimple in Weka. ALSO demonstrate that you have successfully created these particular Bayesian networks by executing code that predict the following: For contact-lenses: contact-lenses = soft, age = presbyopic, other attributes = None For hypothyroid: class = negative, sex = U, other attributes = None Note: You will NOT get full credit for your solution if you hard-code your code to work just for the specified test datasets! Get help with an affordable prices if you face any problem in machine learning and send your request at contact@codersarts.com
- SQL Important Questions
Use the tables below to create queries to answer these questions: 1. What is the average initial margin and average actual margin (if applicable) by make and model? 2. For each month by region, on average how long does a customer take topurchase? 3. For each month by region, what percentage of test drive requests result insales? 4. By region, what is the average number of test drives completed by a vehicle within the first week of being listed? 5. How often does a customer complete the first test drive appointment she schedules? If you need any help then please contact us at contact@codersarts.com and get instant help within affordable price.
- Health Informatics Database Modeling and Application
Multiple Choice Questions For each question below, please select a single correct answer. 1 point for each question. (1) You have a reading of a patientâs temperature at 99.3 0 F. Which category of data this reading belongs to? A. Unstructured data B. Nominal data C. Ordinal data D. Quantitative data (2) CHEM-7, a basic metabolic panel, is a group of blood test that provides information about a patientâs metabolism. It has 7 components: blood urea nitrogen (BUN), carbon dioxide (CO 2 ), creatinine, glucose, serum chloride (Cl - ), serum potassium (K + ), and serum sodium (Na + ). Considering the following result of a CHEM-7 test: CHEM-7 BUN: 15 mg/dl CO 2 : 23 mmol/l Creatinine: 1.1 mg/dl Glucose: 92 mg/dl Cl - : 108 mmol/l K + : 4.1 mEq/l Na + : 138 mEq/l If you select to use entity, attribute and value model to represent the CHEM-7 data, which following statement is incorrect? A. 108 mmol/l is the value of the data attribute Cl - B. CHEM-7 is the data entity C. Creatinine is the value of the data entity CHEM-7 D. Glucose is a data attribute for the CHEM-7 data entity (3) Which of the following statement is not a benefit of using a DBMS? A. Addressing information needs B. Concurrent data access C. Data integrity D. Efficient data access (4) Consider the following design of a table with the records populated. Please identify the problem in the data fields of this table: A. Calculated field B. Multipart field C. Multivalued field D. Unspecified field (5) Considering the following two tables and relationship in a database: Patient Table: Diagnosis Table: The relationship between the two tables? If you need any tutorial or assignment related help then you can contact us at conact@codersarts.com
- Blowfish and ECC algorithms to secure data | Sample Assignment
Description: Load the Heart Disease Dataset from UCI Repository Encrypt the Dataset using Blowfish and ECC Save the encrypted dataset as csv file. Decrypt the Encrypted dataset using Blowfish and ECC Save the decrypted dataset as csv file. Note: All the process is going to be done as per the description given above. Not a real time project. No GUI is provided. Language: Python Front End: Anaconda Navigator - Spyder Contact us to get solution of this cryptography algorithms then you can contact at contact@codersarts.com
- Machine Learning Important Questions
Question 1: Explain what overfitting is and three different techniques to help avoid it when optimizing deep neural networks. Question 2: You have trained five different models on the same set of data and they each get 90% precision. Can you combine these models, without retraining, to get better results? If so explain how. If not, explain why not. Question 3: If you are using learners with regularization and AdaBoost underfits the training data, how should you adjust the parameters of AdaBoost or its learners? Question 4: An office building with 10 floors has 3 elevators, each of which can hold up to 4 people. Every floor has a pair of call buttons to request up or down service, except the top and bottom floors which have only one button each. When the elevator arrives, a person enters and presses the number of the floor they want. Each elevator can store the floor numbers entered and stops at each floor that is requested. Describe the state and action spaces and calculate their size. Describe a reinforcement learner (reward function and learning method) that can learn to control the elevators, delivering passengers as expected while not wasting energy. Be sure to indicate whether delayed rewards should be used. Question 5: Given the three-unit neural network with weights as indicated in which units form products of their weighted inputs rather than sums, write a function for the output value y of node C based on the one input, x, and other network parameters. For example, the input to C would be the product of all incoming weights and associated activations. There are no biases. Unit C is a linear unit, whereas units A and B are sigmoids, Ď(z)=(1âe^âz)^â1 If you need solution of above machine learning project then you can contact us at contact@codersarts.com
- REST API | CodersArts
Rest acronym for Representation state transfer. It is architectural style for distributed hyper media systems and was first represented by fielding in 2000 in his famous, Like any other architectural style REST it also have its own 6 guiding constrains which must be satisfied if an interface need to be referred as Restful, Principles of rest : Client - Server stateless cacheable Uniform Interface Layered System code on demand Rest and HTTP not same A lot of people prefer to compare HTTP with REST. rest and http are not same. REST != HTTP Though, because REST also intends to make the web (internet) more streamline and standard, he advocates using REST principles more strictly. And thatâs from where people try to start comparing REST with web (HTTP). Roy fielding, in his dissertation, nowhere mentioned any implementation directive â including any protocol preference and HTTP. Till the time, you are honoring the 6 guiding principles of REST, you can call your interface RESTful. In simplest words, in the REST architectural style, data and functionality are considered resources and are accessed using Uniform Resource Identifiers (URIs). The resources are acted upon by using a set of simple, well-defined operations. The clients and servers exchange representations of resources by using a standardized interface and protocol â typically HTTP. Resources are decoupled from their representation so that their content can be accessed in a variety of formats, such as HTML, XML, plain text, PDF, JPEG, JSON, and others. Metadata about the resource is available and used, for example, to control caching, detect transmission errors, negotiate the appropriate representation format, and perform authentication or access control. And most importantly, every interaction with a resource is stateless Android assignment Related Help : Contact us : Email :contact@codersarts.com #codersart











