Explain the purpose of data transformation in data analysis.
HW3 Part A Overview:
This is Part A of HW2, focusing on conceptual knowledge of data transformation and ggplot2. Part A has four sections. These questions target your understanding of the functions and concepts included in these lectures. Please reference the ?yourfunction() helper tool in R Studio, the scripts your lecturer used, and when applicable, leverage the RDS book, specifically the chapters on and .
Grading Rubric:
The complete assignment is worth 15 points, Part A is worth 5 points, Part B is worth 10 Points. Each section provides a numeric value of points, and each question is worth the points specified above. In order to receive full credit, the breakdown is as follows:
Section 1: Data Transformation Concepts (1 pt)
In this section, you will be asked to explain the purpose of various data transformation functions and their syntax. Each question has two parts, A and B. We’re looking to evaluate if you understand the use and syntax of the functions. Please keep your explanations concise, targeting 2-3 sentences at most for each part.
1.1 A) Explain the purpose of data transformation in data analysis. B) List three common data transformation tasks that are often performed during data analysis.
1.2 A) Explain the difference between wide and long data formats. B) What is the advantage of using long format data in data visualization with ggplot2?
1.3 A) Explain the concept of tidy data. B) List three principles of tidy data.
1.4 A) What is the purpose of the pipe operator %>% in R? B) Provide a short example of using the pipe operator with the filter() function.
1.5 What are the different use cases for pivot_wider() and pivot_longer(), please provide an example of why would we use each in a dataset.
1.6 (Bonus) A) Why are pivot_longer() and pivot_wider() not perfectly symmetrical in the example below? Can you fix it?
Carefully consider the following example:
stocks <- (
year = (2015, 2015, 2016, 2016),
half = ( 1, 2, 1, 2),
return = (1.88, 0.59, 0.92, 0.17)
)
stocks
(names_from = year, values_from = return)
(`2015`:`2016`, names_to = “year”, values_to = “return”)
(Hint: look at the variable types and think about column names.
has a names_ptypes argument, e.g. names_ptypes = list(year = double()). What does it do?)
Section 2: ggplot2 Functions (2 pts)
In this section, you will be asked to explain the purpose of various ggplot2 functions and their syntax. Each question has two parts, A and B. We’re looking to evaluate if you understand the use and syntax of the functions. Please keep your explanations concise, targeting 2-3 sentences at most for each part.
2.1 A) Briefly explain the concept of the “Grammar of Graphics” as it relates to ggplot2. B) List three main components of a ggplot2 plot.
2.2 A) Explain the role of aesthetics (aes) in ggplot2. B) Provide an example of using aesthetics to “map” a variable to a visual property of a plot.
2.3 A) Explain the purpose of using layers in ggplot2. B) Provide an example of adding a layer to a basic ggplot2 plot.
2.4 A) Explain the role of scales in ggplot2. B) Provide an example of customizing the scale of an axis in a ggplot2 plot.
2.5 A) Explain the purpose of using themes in ggplot2. B) Provide an example of applying a theme to a ggplot2 plot.
2.6 A) Explain the concept of faceting in ggplot2. B) Provide an example of using faceting to create a grid of plots based on a categorical variable.
2.7) Would this code run, if not then why?:
Section 3: Geoms (1 pts)
In this section, you will be asked to identify and briefly explain various ggplot2 geoms. We’re looking to evaluate if you understand the use and purpose of these geoms. Please keep your explanations concise, targeting 1-2 sentences at most for each geom.
Table 1: ggplot2 Geoms
Section 4: Joins (1 pts)
In this section, you will be asked to explain the purpose of various join functions and their syntax. Each question has two parts, A and B. We’re looking to evaluate if you understand the use and syntax of the functions. Please keep your explanations concise, targeting 2-3 sentences at most for each part.
4.1 A) Please explain what the function inner_join() does and B) explain its syntax. (Hint: use ?inner_join())
4.2 A) Please explain what the function left_join() does and B) explain its syntax. (Hint: use ?left_join())
4.3 A) Please explain what the function right_join() does and B) explain its syntax. (Hint: use ?right_join())
4.4 A) Please explain what the function full_join() does and B) explain its syntax. (Hint: use ?full_join())
4.5 A) Please explain what the function semi_join() does and B) explain its syntax. (Hint: use ?semi_join())
4.6 A) Please explain what the function anti_join() does and B) explain its syntax. (Hint: use ?anti_join())
Extra Credit: 4.7 A) Please explain what the function nest_join() does and B) explain its syntax. (Hint: use ?nest_join())
Part 2
In Part B of HW # 2. In this section we will continue to build on your data transformation and viz skills. Please start your own rmd file, include the section and question number in addition to your coded answer and turn in your R Markdown knitted PDF or HTML file. Note: PDF must include your code and output for full points.
Libraries
You’ll need to use the following packages in your R. Script.
library(tidyverse) library(nycflights13) library(gapminder) data(flights) data(diamonds) data(gapminder)
HW 2 Part B Overview
Welcome to the second part of HW2. In this section we will continue to build on your data transformation and viz skills. Please start your own rmd file, include the section and question number in addition to your coded answer and turn in your knitted PDF or HTML on canvas to the assignment named HW2 Part B. Note: Your R markdown file must include your code and output.
Grading Rubric
Section 1: Joins
1.1 Perform a left join on the origin column in the flights dataset and the faa column in the airports dataset. Show the first 10 rows of the resulting dataset.
1.2 Perform an inner join on the flights and weather datasets using the origin, year, month, day, and hour columns. Show the first 10 rows of the resulting dataset.
1.3 A) Using the planes and flights datasets, perform a semi join on the tailnum column to find all the planes that have taken at least one flight. B) Show the first 10 rows of the resulting dataset.
1.4 Perform an anti join on the planes and flights datasets using the tailnum column to find all the planes that have not taken any flights. Show the first 10 rows of the resulting dataset
1.5 Using the airports dataset, create a subset containing only airports in the US that have unique FAA codes present in the flights dataset. (Hint, you might want to use a filtering join)
Section 2: Strings
2.1 Using the airports dataset, create a new column called name_initials containing the initials of the airport name. Show the first 10 rows of the resulting dataset. (Hint: You will want to use str_extract() and a “[expression pattern]”)
2.2 Using the airports dataset, find all airports that have the word “International” in their name. Show the first 10 rows of the resulting dataset.
2.3 In the planes dataset, create a new variable called manufacturer_model that combines the manufacturer and model columns, separated by a hyphen and a space (e.g., “Boeing- 787”). Show the first 10 rows of the resulting dataset.
Section 3: Formatting Time
3.1 Using the flights dataset, create a new column called dep_date that combines the year, month, and day columns into a single date column, then properly format the column to be a <date>, data type. Show the first 10 rows of the resulting dataset.(Hint: You can use either paste() or paste0())
3.2 Calculate the average departure delay (dep_delay) in the flights dataset for each hour of the day. Plot the results using a line graph with labels for hours on the x-axis, average delay on the y-axis, a nice title and a caption for you to add in your name.
3.3 Using the flights dataset, calculate the average departure delay (dep_delay) for each day of the week. Plot the results using a bar graph with days of the week on the x-axis and average delay on the y-axis. Order the x-axis based on the days of the week (i.e., Monday, Tuesday, Wednesday, etc.).
Extra Credit: Using the flights dataset, calculate the average departure delay (dep_delay) for each day of the week and each carrier (carrier). Plot the results using a grouped bar graph with days of the week on the x-axis, average delay on the y-axis, and different colors for each carrier. Order the x-axis based on the days of the week (i.e., Monday, Tuesday, Wednesday, etc.).
Section 4: Pivots
4.1 Using the diamonds dataset, create a new data frame that shows the total price (price * carat) of each cut and clarity combination. Pivot the data frame so that the cut values become the columns and the clarity values become the rows. Plot the resulting data frame as a heatmap using ggplot2. (Hint: you can use geom_tile() for your visual)
Extra Credit: Using the gapminder dataset, create a new data frame that shows the average life expectancy (lifeExp) for each year and continent combination. Change the data frame so that the year values become the rows and the continent values become the columns. Plot the resulting data frame as a stacked area chart using ggplot2. The plot should look like the one below, choose your own colors using scale_fill_manual() See this for color options.
Average Life Expectancy by Year and Continent
Stefy is the best ta
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.