Module 6: Applied Data Analytics Capstone Project
Module 6: Applied Data Analytics Capstone
Project Presentation Instructions
The narrated presentation should include slides on the following:
- Introduction (one slide)
- Data (two slides)
- Methods (one slide)
- Analysis (two to three slides)
- Conclusion (one slide)
INTRODUCTION
First summarize the purpose of the presentation and the data being analyzed. Then summarize the questions posed in the analysis of the data and the conclusions formed from the analysis. Finally, briefly outline what is contained in the rest of the presentation.
DATA
Give a description of the most important data used for analysis in this section. This should include outlining the types of data, count, any pre-processing needed, if there were missing values and steps taken, any need for formatting, normalizing, or transforming categorical values to quantitative variables.
METHODS
Give the methods you used to gather the data and the set up used for analysis.
ANALYSIS AND RESULTS
Include in this section what was analyzed and the conclusions you made from the analysis. Insert any charts you created from the data in this section. Should be based on analysis techniques that have been covered in class including: descriptive statistics, correlations, ANOVA, linear regression, and multiple linear regression.
CONCLUSION
Restate the questions you raised in the Introduction, as well as the most relevant results from the analysis. If your presentation contains more than one set of data or analysis, this is the place to compare the different results as needed. Include any questions or recommendations for additional data as needed.
I have attached the files and you have to make slides from these files according to the instructions. You will not need anything else as I already attached the files
Page | 13
[Type the company name] Data Analytics on COVID-19 Variants Project Report Bs 2/14/2022
Table of Contents Introduction: 4 Data: 5 Methods: 7 Analysis: 8 Conclusion: 10 Appendix: 12
Table of Figures
Introduction:
In this report, we will analyze the dataset of COVID-variants. We will apply the technique of data pre-processing in the dataset. Then, we will apply the technique of exploratory data analysis to view the data in the form of graph. Then, we will apply linear regression model to predict the data in the dataset.
First of all, we import the python libraries and load the dataset. Then, we apply data-preprocessing techniques on the dataset. In data-preprocessing, raw data is converted into the meaningful data. We check the quality of data before applying machine learning. For this purpose we need to remove the irrelevant data. Then, we need to remove the duplicate records from the dataset. After this, we check the data type of the column whether it is correct or not. Then, we need to standardize the data. Here, the mean value is 0 and standard deviation is 1 to scale the values in the dataset. Then, we check the outliers in the dataset. If outlier is present in the dataset, then remove it. After remove the outliers, we need to handle the missing data in the dataset. Then, normalize the data. At the end, we need to encode the categorical data into numeric data. For this purpose, we use label encoding or one hot encoding. In this way, we can clean our raw data into understandable format.
The next step is exploratory data analysis to view the chart and graphs of the dataset. We plot many graphs such as histogram, bar charts, heat map, line graphs, and box plot and frequency table. These graphs help to view the different type of data in the dataset.
The next step is to apply the machine learning model in our dataset. For this purpose, we apply linear regression model to predict the data in our dataset. It can show the actual and predicted value of the dataset. At the end, we find the model accuracy and make the scatter plot.
Data:
The data is all about COVID-variants. It consists of 100416 rows and 6 columns. It tells that the location where variants found. It also tells the date when it founds. It also tells the number of sequences, percentage of sequences and total number of sequences in COVID-19. There are 24 types of COVID variants in our dataset which is as follows:
1. Alpha |
2. B.1.1.277 |
3. B.1.1.302 |
4. B.1.1.519 |
5. B.1.160 |
6. B.1.177 |
7. B.1.221 |
8. B.1.258 |
9. B.1.367 |
10. B.1.620 |
11. Beta |
12. Delta |
13. Epsilon |
14. Eta |
15. Gamma |
16. Iota |
17. Kappa |
18. Lambda |
19. Mu |
20. Omicron |
21. S:677H.Robin1 |
22. S:677P.Pelican 23. Non who |
24. Others |
Here, we tell the description of data of each variable in the data set and also tells the data type of each variable in the data set.
Variable |
Description |
Data Type |
Location |
Location where variants found |
String |
Date |
Date when it’s found |
Date time |
Variants |
COVID-19 variants |
String |
Number of sequences |
Number of sequences in COVID-19 |
Integer |
Percentage of sequences |
Percentage of sequences in COVID-19 |
Float |
Total number of sequences |
Total Number of sequences in COVID-19 |
Integer |
Properties of Data types:
Here, we discuss some properties of the variables.
Variable |
Discrete/ Continuous |
Categorical/ Ordinal/ Quantitative |
Location |
Discrete |
Quantitative |
Date |
Discrete |
Quantitative |
Variants |
Discrete |
Categorical |
Number of sequences |
Discrete |
Quantitative |
Percentage of sequences |
Continuous |
Quantitative |
Total number of sequences |
Discrete |
Quantitative |
Methods:
In this project, the method which we used to train and predict the model is Linear Regression. Linear Regression is the relationship between data-points. It is used to draw the straight line. This straight line helps us to predict the future values. The equation of linear regression is as follows:
Here, X is the independent variable and Y is the dependent variable. The slope of line is “ a” and intercept is b. Now, we build the model on the dataset using linear regression model and predict the future values.
When the input variable is one, it is known as linear regression. And when the input variable is more than one than it is known as multiple linear regression. The equation of multiple linear regression is as follows:
We applied linear regression technique in my dataset to predict the value of total number of sequences in the database. For this purpose, we split the data into test and train dataset with the ratio of 7:3. Then, we build the model and fit the model into train and test data. At the end, we predict the model and print the predicted model.
We can also apply many other machine learning algorithms to predict the model such as KNN, Multiple Linear Regression, Logistic Regression, Decision Tree etc. Different algorithms give different result on the same dataset. The accuracy of the model changes on each model.
Analysis:
Now, we are going to analyze the dataset using exploratory data analysis (EDA) and build the model on the dataset. First of all, we need to pre-process the dataset. Data preprocessing is used to clean the data from the raw data. The steps of data preprocessing is as follows:
· Import Libraries and Read Data
· Remove irrelevant Data
· Remove Duplicate Records
· Check Data types
· Standardize the data
· Investigate the Outliers
· Handle the Missing Data
· Normalize the data
· Encoding Categorical Data
Now, we perform the exploratory data analysis in this dataset. The histogram of the COVID Variants Dataset is shown below:
The bar chart of the COVID Variants Dataset is shown below:
The heat map of the COVID Variants Dataset is shown below:
The line graph of the COVID Variants Dataset is shown below:
The box plot of the COVID Variants Dataset is shown below:
After exploratory data analysis, we want to build the model using linear regression. First of all we need to split the data into training and testing data. Here, we split the train and test data into 70% and 30%. Then, we reshape the train and test data. Then, we build the Linear regression model and fit the model into train and test data. The slope and intercept values of our regression model is 0.11 and 71.64 respectively.
Here, we predict the model. The results of predicted model are shown below:
Conclusion:
To find the conclusion of the dataset, we need to compare the actual and predicted values of the dataset. The actual and predicted values of the dataset are shown below:
We can see that there is a huge difference between actual and predicted values. It is because the values in the dataset are ambiguous. There are many outliers occur in the dataset. So, we need to remove the outliers then train and test the model. The accuracy of the model is shown below:
The model accuracy of the data is not fine. Because the data is ambiguous, the error rate in the dataset is high and we cannot get the good accuracy of the dataset. Now, we create the scatter plot. The scatter plot with Trend line is shown below:
Hence, we conclude that most of the data contains the null values. So, we need to remove the null values and remove the ambiguity of the data. So that, we can get high accuracy of the data using Linear Regression.
Appendix:
APPENDIX A: Linear Regression
Linear Regression is the relationship between X (the independent variable) and Y (the dependent variable). It is used to draw the straight line between X and Y. This straight line helps us to determine the future values. The equation of linear regression is as follows:
The slope of line is “ a” and intercept is b. Now, we build the model on the dataset using linear regression model and predict the future values. When the input variable is one, it is known as linear regression. And when the input variable is more than one than it is known as multiple linear regression. The equation of multiple linear regression is as follows:
,
Exploratory Data Analysis (EDA) and Linear Regression [Type the document subtitle] Bs 3/14/2022
Assignment 3
Part A:
First of all, we need to import the libraries and read the CSV file of COVID Variants and print the first five rows of the dataset which is as follows:
Descriptive statistics:
Then, we need to analyze the descriptive statistics of data frame. Descriptive statistics give the statistic description of the dataset. We can calculate sum( ) and mean( ) and count( ) etc. of each columns of the dataset. Here, we calculate the sum() of the dataset.
Here, we calculate the mean() of the dataset.
Here, we find the statistic description of categorical data of the dataset.
Here, we find the statistic description of all the attributes of the dataset.
Histograms:
The histogram of the COVID Variants Dataset contains the plots of numeric data. This plot consists of num_sequences, perc_sequences and num_sequences_total. The histogram of the COVID Variants Dataset is shown below:
Bar charts:
The bar chart of the COVID Variants Dataset is shown below:
Heat maps:
The heat map of the COVID Variants Dataset is shown below:
Line graphs:
The line graph of the COVID Variants Dataset is shown below:
Box plots:
The box plot of the COVID Variants Dataset is shown below:
Frequency tables:
Now, this is the frequency table of COVID Variants Dataset. This frequency table consists of the number of variants in the dataset it tells the frequency of each variant.
This frequency table consists of the location where variants are found. It tells the frequency of each location.
,
Data Preprocessing Assignment Kunal Shah 3/7/2022
Data Preprocessing:
Data preprocessing is the technique which is used to convert the row data into the understandable format. We need to check the quality of data before applying the machine learning algorithms. Here, we discuss the steps of data preprocessing.
Import Libraries and Read Data:
First of all, we need to import the libraries and read the .csv file of COVID Variants and print the first five rows of the dataset which is as follows:
Remove irrelevant Data:
Now, we need to remove the irrelevant data. This is also known as feature scaling. We select those features which contains some information of data and drop the others features. The selected features are shown below:
T
Remove Duplicate Records:
Now, we need to remove the duplicates record from the dataset. You can see that there is no duplicate record present in the dataset. The length of the dataset is 100416
Check Data types:
After this, we check the data types of dataset whether it is correct or not. You can see that it gives the correct data types of each variable.
Standardize the data:
Now, we need to standardize the data. It can scale all the values in the dataset with the mean value are 0 and standard deviation value is 1. The standardize data is shown below:
Investigate the Outliers:
Now, we need to check the outlier in the dataset. The outlier is the data point which is far from the other values in the dataset. For this purpose, we draw the box plot to find out the outliers.
Now, we need to apply z-score technique to find the index of outlier. This is the z-score index of the outlier which is present in our dataset.
Missing Data:
Now, we need to handle the missing data from the dataset. You can see that there is no missing data present in the dataset.
Normalize the data:
Then, we need to normalize the data. Normalization is the techniques which convert the numeric columns into the standard scale. In machine learning, some values are different from the other value multiple times. Here, you can see the normalize data.
Encoding Categorical Data:
At the end, we use Label Encoder to encode the data. It converts categorical data into numeric data. We convert the data into x and y. The x value contains the data frame that is as follows:
The y value contains the target value that is as follows:
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.