Please make sure that it is your own work and not copy and paste. Please watch out for Spelling and Grammar errors. Please read t
Please make sure that it is your own work and not copy and paste. Please watch out for Spelling and Grammar errors. Please read the study guide. Please use the APA 7th edition.
Book Reference: Fox, J. (2017). Using the R Commander: A point-and-click interface for R. CRC Press. https://online.vitalsource.com/#/books/9781498741934
Discuss how you would use the various types of summarizing and graphing to present your data. Make sure you discuss the type of data you would have and the type of display you would select.
Summarizing and Graphing Data
This chapter explains how to use the R Commander to compute simple numerical summaries of data, to construct and analyze contingency tables, and to draw common statistical graphs. Most of the statistical content of the chapter is covered in a typical basic statistics course, although a few topics, such as quantile-comparison plots (in Section 5.3.1) and smoothing scatterplots (in Section 5.4.1), are somewhat more advanced.
Although most of the graphs produced by the R Commander use color, most of the figures in this chapter are rendered in monochrome.1
5.1 Simple Numerical Summaries
The R Commander Statistics > Summaries menu (see Figure A.4 on page 202) contains several items for summarizing data. I’ll use the Canadian occupational prestige data (introduced in Section 4.2.3) to illustrate. This data set is most conveniently available in the Prestige data frame in the car package, which is one of the packages loaded when the R Commander starts up. I read the data via Data > Data in packages > Read data set from an attached package (as described in Section 4.2.4). Because the default alphabetic order of the levels of the type factor in the data set—“bc” (blue-collar), “prof” (professional and managerial), “wc” (white-collar)—is not the natural order, I reorder the levels of the factor with Data > Manage variables in active data set > Reorder factor levels (see Section 3.4).
Selecting Statistics > Summaries > Active data set produces the brief summary in Figure 5.1. There’s a “five-number summary” for each numeric variable—reporting the minimum, first quartile, median, third quartile, and maximum of the variable—plus the mean, and the frequency distribution of the factor type, including a count of NAs.
Statistics > Summaries > Numerical summaries brings up the dialog box in Figure 5.2. I select the variables education, income, prestige, and women in the Data tab and retain the default choices in the Statistics tab. Clicking OK results in the output in Figure 5.3. Were I to press the Summarize by groups button in the Data tab, I could compute summary statistics separately for each level of type.
Choosing Statistics > Summaries > Table of statistics allows you to calculate a statistic for one or more numeric variables within levels or combinations of levels of one or more factors. To illustrate, I’ll use the Adler data set from the car package. The data are from a social-psychological experiment, reported by Adler (1973), on “experimenter effects” in psychological research—that is, how researchers’ expectations can influence the data that they collect. Adler recruited “research assistants,” who showed photographs of individuals’ faces to respondents; the respondents were asked by the research assistants to rate the apparent “successfulness” of the individuals in the photographs. In fact, Adler chose photographs that were average in their appearance of success, and the true subjects in the study were the research assistants. Adler manipulated two factors, named expectation and instruction in the data set.
FIGURE 5.1: Summary output for the Prestige data set.
FIGURE 5.2: Numerical Summaries dialog box: Data tab (top) and Statistics tab (bottom).
FIGURE 5.3: Numerical summaries for several variables in the Prestige data set.
FIGURE 5.4: The Table of Statistics dialog box.
• expectation: Some assistants were told to expect high ratings, while others were told to expect low ratings.
• instruction: In addition, the assistants were given different instructions about how to collect data. Some were instructed to try to collect “good” data, others were instructed to try to collect “scientific” data, and still others were given no special instruction of this type.
Adler randomly assigned 18 research assistants to each of six experimental conditions—combinations of the two levels of the factor expectation (“HIGH” or “LOW”) and the three levels of the factor instruction (“GOOD”, “SCIENTIFIC”, or “NONE”). I deleted 11 of the 108 subjects at random to produce the “unbalanced” Adler data set.2 After reading the data into the R Commander in the usual manner, I reorder the levels of the factor instruction from the default alphabetic ordering.
The Table of Statistics dialog box appears in Figure 5.4. I select both expectation and instruction in the Factors list box; because there’s just one numeric variable in the data set—rating—it’s preselected in the Response variables list box. The dialog includes radio buttons for calculating the mean, median, standard deviation, and interquartile range, along with an Other button, which allows you to enter any R function that computes a single number for a numeric variable. I retain the default Mean, and press the Apply button. Then, when the dialog reappears, I select Standard deviation and press OK. The output is shown in Figure 5.5. I’ll defer interpreting the Adler data to Section 5.4 on graphing means and Section 6.1 on hypothesis tests for means.
Several of the Statistics > Summaries menu items and associated dialogs are very straightforward, and so, in the interest of brevity, I won’t demonstrate their use here:3
FIGURE 5.5: Tables of means and standard deviations for rating in the Adler data set, classified by expectation and instruction.
• The Frequency Distributions dialog produces frequency and percentage distributions for factors, along with an optional chi-square goodness-of-fit test with user-supplied hypothesized probabilities for the levels of a factor.
• The Count missing observations menu item simply reports the number of NAs for each variable in the active data set.
• The Correlation Matrix dialog calculates Pearson product-moment correlations, Spearman rank-order correlations, or partial correlations for two or more numeric variables, along with optional pairwise p-values, computed with and without correction for simultaneous inference.
The Statistics > Contingency tables menu (see Figure A.4 on page 202) has items for constructing two-way and multi-way tables from the active data set. I demonstrated the Two-Way Table dialog in Section 3.5, and there is no need to repeat that demonstration here. Moreover, the Multi-Way Table dialog is similar, except that, in addition to selecting row and column factors for the contingency table, you can pick one or more “control” factors: A separate two-way partial table, optionally percentaged by rows or columns, is reported for each combination of levels of the control factors.
In contrast, the Enter Two-Way Table dialog (in Figure 5.6), selected via Statistics > Contingency tables > Enter and analyze two-way table, is unusual for the R Commander, in that it doesn’t use the active data set. The dialog allows you to enter frequencies (counts) from an existing two-way contingency table, typically from a printed source such as a textbook. The sliders at the top of the Table tab control the number of rows and columns in the table. Initially, the table has 2 rows and 2 columns, and the cells of the table are empty.
Setting the sliders to 3 rows and 2 columns, I enter a frequency table taken from The American Voter, a classic study of electoral behavior by Campbell et al. (1960). The data originate in a panel study of the 1956 U. S. presidential election. During the campaign, survey respondents were asked how strongly (weak, medium, or strong) they preferred one candidate to the other, and after the election they were asked whether or not they had voted.
FIGURE 5.6: The Enter Two-Way Table dialog: Table tab (top) and Statistics tab (bottom).
FIGURE 5.7: Output produced by the Enter Two-Way Table dialog, having entered a contingency table from The American Voter.
The Statistics tab appears at the bottom of Figure 5.6 . I check the box for Row percentages because the row variable in the table, intensity of preference, is the explanatory variable; the Chi-square test of independencecheckbox is selected by default. I also check Print expected frequencies, which is not selected by default.
The output from the dialog is shown in Figure 5.7 . Reported voter turnout increases with intensity of partisan preference, and the relationship between the two variables is highly statistically significant, with a very small p-value for the chi-square test of independence. All of the expected counts are much larger than necessary for the chi-square distribution to be a good approximation to the distribution of the test statistic; had that not been the case, a warning would have appeared, whether or not expected frequencies are printed.
5.3 Graphing Distributions of Variables
I’ll use the Canadian occupational prestige data, read from the car package earlier in this chapter, to illustrate graphing distributions. There are, at this point in the chapter, two data sets in memory—the Prestige data set and the Adler data set—and the latter is the active data set. To change the active data set, I click on the Data set button in the R Commander toolbar and select Prestige in the resulting dialog. 4
The R Commander Graphs menu (see Figure A.6 on page 203 ) is divided into several groups of items, the second of which leads to dialogs for constructing graphs of the distribution of a numerical variable: Index plot, Dot plot, Histogram, nonparametric Density estimate, Stem-and-leaf display, Boxplot, and theoretical Quantile-comparison plot. Many of these graphs—specifically, dot plots, histograms, density estimates, and boxplots—can also show the distribution of a numeric variable within levels of (i.e., conditional on) a factor, and stem-and-leaf displays can be drawn back-to-back for the two levels of a dichotomous factor (see the example in Section 6.1.1 ).
Selecting Graphs > Histogram produces the dialog box in Figure 5.8 . The Data tab, at the top of the figure, allows you to choose a numeric variable; I select income. Clicking the Plot by groups button brings up the Groups sub-dialog shown at the center of the figure; because there is only one factor in the data set, type, it is preselected. Clicking OK in the Groups sub-dialog returns to the main dialog, and now the Plot by button reads Plot by: type. The Options tab is at the bottom of Figure 5.8 . Leaving all of the options at their defaults and clicking OK produces the vertically aligned histograms in Figure 5.9 .
If you don’t like the default number of bins, which results from leaving the Number of bins text box at <auto>, you can type a target number for the number of bins: 5 As a general matter, as you increase the number of bins, the width of each bin decreases. You can conveniently experiment with the number of bins by pressing the Apply button rather than the OK button in the dialog.
The dialogs for the other distributional displays differ only in their Options tabs and whether or not (as noted above) they support plotting by groups. Figure 5.10 shows the default distributional displays for education in the Canadian occupational prestige data set. 6 There is also a “rug plot” at the bottom of the density estimate (center-right panel), showing the location of the data values. By default the quantile-comparison plot (lower-right) compares the distribution of the data to the normal distribution, but you can also plot against other theoretical distributions. 7
FIGURE 5.8: Histogram dialog, showing the Data tab (top), Groups sub-dialog (center), and Options tab (bottom).
FIGURE 5.8: Histogram dialog, showing the Data tab (top), Groups sub-dialog (center), and Options tab (bottom).
FIGURE 5.9: Histograms of average income by type of occupation, for the Canadian occupational prestige data.
In the index plot (at the upper-left) and quantile-comparison plot (at the lower-right), the two most extreme values are automatically identified by default, but because these values are close to each other in the graphs, the labels for the points are over-plotted. The case labels are also displayed, however, in the R Commander Output pane (not shown), and they are university.teachers and physicians.
The default stem-and-leaf display for education appears in Figure 5.11 ; it is text output and so is printed in the Output pane.
FIGURE 5.10: Various default distributional displays for average education in the Canadian occupational prestige data. From top to bottom and left to right: index plot, dot plot, histogram, nonparametric density estimate with rug plot, boxplot, and quantile-comparison plot comparing the distribution of education to the normal distribution.
FIGURE 5.11: Default “Tukey-style” stem-and-leaf display for education in the Canadian occupational prestige data. The column of numbers to the left of the stems represents “depths”—counts in to the median from both ends of the distribution—with the parenthesized value (4) giving the count for the stem containing the median. Note the divided stems, with x. stems containing leaves 0–4 and x * stems leaves 5–9. Five-part stems are similarly labelled x. with leaves 01, x t with leaves 23, x f with leaves 45, x s with leaves 67, and x * with leaves 89.
FIGURE 5.12: Bar Graph dialog, showing the Data tab (top) and Options tab (bottom). I previously pressed the Plot by button and selected the factor vote.
FIGURE 5.12: Bar Graph dialog, showing the Data tab (top) and Options tab (bottom). I previously pressed the Plot by button and selected the factor vote.
5.3.2 Graphing Categorical Data
I’ll demonstrate graphing the distribution of a categorical variable by using the Chile data set from the car package. This data set is from a poll conducted about six months before the 1988 Chilean plebiscite on the continuation of military rule: voting “yes” in the plebiscite represented support for Pinochet’s military government, while “no” represented support for a return to electoral democracy. Two of the variables in the Chile data set are the factors vote, with levels “N” (no), “Y” (yes), “U” (undecided), and “A” (abstain), and education, with levels “P”(primary), “S” (secondary), and “PS” (post-secondary). In both cases, the default alphabetic ordering of the factor levels isn’t the natural ordering, and so, after reading the data, I change the orderings via Data > Manage variables in active data set > Reorder factor levels (see Section 3.4).
The Graphs menu includes two simple distributional plots for factors: frequency bar graphs and pie charts. Because it allows for dividing bars by the value of a second factor, the Bar Graph dialog, shown in Figure 5.12, is the more complex of the two. In the Data tab, at the top of the figure, I select the factor education to define the bars. I previously pressed the Plot by button and chose vote in the resulting Groups sub-dialog, and so the button displays Plot by: vote. I retain all of the default choices in the Options tab at the bottom of Figure 5.12. Clicking OK produces the graph in Figure 5.13. It’s apparent that relative support for the military government declined with education, but that overall the plebiscite appeared close (visually summing and comparing the “N” and “Y” areas across the bars).
FIGURE 5.13: Bar graph for education in the Chilean plebiscite data, with bars divided by vote. A color version of this figure appears in the insert at the center of the book.
Overall voting intentions are displayed in the pie chart in Figure 5.14. The Pie Chart dialog, not shown, simply allows you to pick a factor and, optionally, provide axis labels and a graph title.
FIGURE 5.14: Pie chart for vote in the Chilean plebiscite data. A color version of this figure appears in the insert at the center of the book.
The third section of the Graphs menu is for graphing relationships between and among variables, including scatterplots, scatterplot matrices, and 3D scatterplots for numeric variables, line plots, which are typically for time series data, plots of means of a numeric variable classified by one or more factors, strip charts, which are similar to conditional dot plots (discussed in Section 5.3.1), and conditioning plots, which are capable of representing the relationships between one or more numeric response variables and explanatory variables that are both numeric and factors.8 I’ll focus here on scatterplots for two numeric variables, scatterplot matrices for several numeric variables, 3D scatterplots for three numeric variables, and plots of means of a numeric variable classified by one or two factors.
In addition, and as mentioned previously, some of the distributional graphs discussed in Section 5.3.1 can be used to examine the relationship between a numeric response variable and a factor. These include dot plots, histograms, stem-and-leaf displays (with a dichoto-mous factor), and boxplots.
To illustrate the construction of scatterplots, scatterplot matrices, and 3D scatterplots, I return to the Canadian occupational prestige data in the previously read Prestige data set. Choosing Graphs > Scatterplot from the R Commander menus brings up the dialog box in Figure 5.15. As you can see, there are many options in the dialog, some of which I’ll describe presently. In the Data tab, I select income (which is the explanatory variable) as the x-variable and prestige (the response variable) as the y-variable. I retain all of the defaults in the Options tab, clicking Apply to draw the simple scatterplot in Figure 5.16. Occupational prestige apparently increases with income, but the relationship is nonlinear, with the rate of increase declining with income.
To draw the scatterplot in Figure 5.17, I click on the Plot by groups button in the Data tab; because it’s the only factor in the data set, type is preselected in the resulting Groups variable sub-dialog (not shown). The sub-dialog also has a checkbox for plotting lines by group, which is selected by default. In the Options tab, I check the boxes for Least-squares line, Smooth line, and Plot concentration ellipses. I also change the Legend Position from the default Above plot to Bottom right.
The smooth line is produced by a method of nonparametric regression called loess, an acronym for local regression, which traces how the average value of y changes with x without assuming that the relationship between yand x takes a specific form. The span of the loess smoother is the percentage of the data used to compute each smoothed value: The larger the span, the smoother the resulting loess regression. The default span is 50%, a value that I increase to 100% because of the small number of cases in each level of occupational type. As a general matter, you want to select the smallest span that produces a reasonably smooth regression, a value that you can determine by trial and error, pressing the Apply button in the dialog each time you adjust the Span slider.
Concentration ellipses are summaries of the variational and correlational structure of the points. For bivariately normally distributed data, concentration ellipses enclose specific fractions of the data—50% and 90% by default; the ellipses are computed robustly, however, to reduce the impact of outliers. To avoid an overly cluttered graph, I set the Concentration levels to 0.5, to draw only one ellipse for each occupational type.
The scatterplot in Figure 5.17 suggests that the apparently nonlinear relationship between prestige and incomeis due to occupational type: Within levels of type, the relationship is reasonably linear, but with the slope changing across levels.
FIGURE 5.15: Scatterplot dialog: Data tab (top) and Options tab (bottom).
FIGURE 5.16: Simple scatterplot of prestige vs. income for the Prestige data.
FIGURE 5.17: Enhanced scatterplot of prestige vs. income by occupational type, showing 50% concentration ellipses, least-squares lines, and loess lines. A color version of this figure appears in the insert at the center of the book.
A scatterplot matrix displays the pairwise relationships among several numeric variables; it is the graphical analog of a correlation matrix. The Scatterplot Matrix dialog, shown in Figure 5.18 , is similar in most respects to the Scatterplot dialog. I select several variables in the Data tab and leave all of the choices in the Options tab at their defaults. Each off-diagonal panel in the resulting scatterplot matrix in Figure 5.19 displays the pairwise scatterplot for two variables, while the diagonal panels show the marginal distributions of the variables. The plots in the first row, for example, have education on the vertical axis, while those in the first column have education on the horizontal axis—and similarly for the other variables in the graph. Thus, the scatterplot in the second row, first column
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.