Article Writing Question
Instructions:
Read chapter 11 of the textbook. Write a summary of Chapter 11 with 1 page (times new roman, 12 pt font, and single-spaced). To submit your assignment, click “Add Content” below and click the paperclip to attach your file.
CHAPTER 11 Predictive Machine Learning LEARNING OBJECTIVES After completing this chapter, you will be able to: Explain the term “machine learning.” Discuss various predictive data models. Identify which models are applicable for which types of predictive scenarios. In Chapter 10 we discussed a scenario involving Nina, the sales manager at GB. Nina is considering increasing the price of off-road bicycles. Before she takes this step, however, she wants to create a forecast of the impact of a price increase on sales and profits. She is aware that the price increase could have unintended consequences, such as reducing the quantity of bicycles GB will sell. Put simply, Nina needs to know how many bicycles customers will buy at the new, higher price. Another key question is how any change in bicycle purchases will affect companion sales that usually occur when a person buys a bicycle. We learned how to create a forecast for Nina in Chapter 10. In this chapter we will learn how to make predictions to answer the other questions involved in Nina’s analysis. When you have completed this chapter, you should be able to help Nina with the predictions that will go into her sales and profits forecast. PREDICTIVE DATA MODELS In Chapter 8 we explained that predictive data mining involves the partitioning of datasets with known target variables into three subsets to train, validate, and test a model. This process is known as supervision of the model. Supervised data models are also known as “machine learning.” This term highlights the capabilities of the model to “learn” and adapt to new data feeds. Machine learning is ideal for large, complex problems, and it is at the heart of artificial intelligence (AI). The relationship between machine learning and other analysis techniques is illustrated in Figure 11-1 FIGURE 11-1: MACHINE LEARNING Predictive models are used for estimation and classification. Estimation models attempt to approximate or otherwise determine outcomes based on multiple parameters and known relationships. These outcomes are expressed as mathematical algorithms or parametric equations; that is, equations that express a set of dependent variables as functions of independent variables. A very common estimation example is a budgeting model that an analyst employs to predict sales, profits, cash flows, and other financial indicators. Analysts typically carry out estimations using regression analysis. The second group of predictive models is classification models, or classifiers. As the name suggests, analysts use these models to classify or categorize data, entities, and events to identify patterns that explain how different predictor variables in a model contribute to an outcome. The classifier groups new data into categories of the target variable based on past trends or patterns. It therefore provides a probability or confidence for new classifications. As an example, based on other customers’ purchasing patterns, Nina could use a classification model to predict the probability that a particular customer will purchase an off-road bike. The following two sections discuss some commonly used predictive estimation and classification models. Estimation Models Estimation models are used to predict a specific value of a numeric dependent, or target variable. For example, Nina may wish to predict the following: (a) revenue for electric bikes for next year, (b) the number of customers who will not make any purchases next month, and (c) revenue per customer for the next year. She will use an estimation model to predict each of these values. To determine how well her model has worked, she will compare the actual values to the predicted values, using a validation partition. Figure 11-2 displays the actual revenue and the predicted revenue by customer using Nina’s estimation model. The closer the predicted revenues are to the actual revenues, the more accurate the model is. Notice that some predicted values are higher than the actual value, and others are lower. For this reason we can’t simply add up all of these variances (differences) to evaluate how well the model performed. Instead, we use statistical methods such as root mean square error (RMSE) to estimate the disparity between the predicted revenue and the actual revenue. For more information on RMSE, you might dust off your stats book or perhaps refer to this web link: https://www.statisticshowto.datasciencecentral.com/rmse/. In this example, the RMSE is $62,045.43. This amount is fairly large, which suggests that Nina’s model is not particularly accurate. FIGURE 11-2: ACTUAL VS PREDICTED REVENUE BY CUSTOMER Probably the most commonly known and easiest-to-understand estimation model is simple linear regression. Simple linear regression creates an arithmetic equation to explain the relationship between independent and dependent variables. It is important to note that both variables have to be numeric. The independent variable is also called the predictor or explanatory variable, and the dependent variable is also called the target variable. The goal of simple linear regression is to fit a straight line through the points on a chart between the dependent and independent variables, as illustrated in Figure 11-3. The chart displays a natural phenomenon of the snowy tree cricket’s chirps. The number of chirps is a decent measure of air temperature within the bounds of applicability, called the relevant range. The scatter plot in the figure was created by counting the number of chirps per 15 seconds and plotting that number against the actual temperature when the chirps occurred. A regression line shows the relationship between the two variables. Note that this line should be extrapolated only within a certain range of temperatures because the cricket stops chirping outside its temperature of habitation. The regression line is expressed as a linear equation Y = a + bX Temperature = a + b * chirps where a is the y-intercept and b is the slope of the line. Temperature = 23 + 3.4 * chirps If we wanted to predict the temperature based on a chirp rate of 19, we would multiple 19 by 3.4 and add 23 to get 87.6° F. FIGURE 11-3: SIMPLE LINEAR REGRESSION LINE Figure 11-4 presents sample data that may be used to develop a simple linear regression model to forecast revenue from GB customers. The table in the figure summarizes the revenue, location, and city population data for GB customers. FIGURE 11-4: GB CUSTOMER DATA Nina believes that city population by itself is a good predictor of revenue. She used simple linear regression to fit a straight line through a scatter plot of revenue and population, as illustrated in Figure 11-5. By using the formula of the line or by simply finding the intersection of population and revenue on the chart, Nina can predict revenue for a new customer. As an example, a new customer in a city with a population of 6.2 million people has started purchasing bicycles from GB. Nina wishes to predict sales from this customer for the upcoming year for her annual sales forecast. Because she does not have historical sales data for this customer, she will have to use her linear regression model to estimate sales. Based on the chart in Figure 11-5, the new customer should make annual purchases of approximately $1,600,000. Nina derived this figure by finding the intersection of the line based on the population and tracing the intersection back to the y-axis. The intersection is traced in blue on the figure. FIGURE 11-5: REVENUE BY POPULATION As you can imagine, many straight lines can be fitted to the data. We choose the line that minimizes the errors, or differences, between the line and the data points. In fact, we use the sum of squares of the errors. The distance between the predicted value and the observed value is called residual or error. Recall our example of crickets and temperature, which is plotted in Figure 11-6. FIGURE 11-6: TEMPERATURE VS. CHIRPS Using the cricket example, these are the variables involved in computing the sum of squares errors: is the ith predicted value (line); yi is the ith observed value; n is the number of data points. We can compute the sum of squares of errors as shown below. Sum of squares errors (SSE) SSE is a squared value. We then compare the square root of SSE to the temperature. Figure 11-7 displays an example of root mean square error. FIGURE 11-7: ROOT MEAN SQUARE ERROR DEFINED The process of calculating RMSE raises some questions. What value of RMSE is acceptable? Should it be a percentage of the average temperature? How “good” is this simple regression? To what extent can we trust the prediction? To answer these questions, let’s compute some more errors and see how well they help us determine how good the regression is. In the absence of regression, our best prediction would simply be the mean of all observations. We call this value y. The sum of squared errors without regression is called total sum of squares (SST). Therefore, the sum of squares explained by regression (SSR) becomes: FIGURE 11-8: TOTAL SUM OF SQUARES (SST) The formula for total sum of squares (SST) is: SST = SSR + SSE R-squared (R2), also known as the coefficient of determination, is a statistical measure of goodness-of-fit; that is, how close the regression line fits the data. R-squared varies between 0 and 1 (0% and 100%). A higher value indicates that the linear regression line fits the data well. The value indicates the fraction of the total error that the regression can explain. Thus, R-squared of 0.50 means that half of the total error can be explained by regression. An R-squared of 0.70 means that 70% of the total error can be explained by the regression, and so on. The formula for R-squared is: FIGURE 11-9: R-SQUARED Although a higher R-squared value means that the model fits the data well, there are limitations to this method. For example, R-squared cannot determine if our prediction is biased. Therefore, we cannot put too much emphasis on high R-squared values and conclude that our regression is good. Further, simple regression cannot fit a nonlinear curve even though its R-squared could be high! For this reason, R-squared should be used in conjunction with other statistics, a plot of the residual values, and domain knowledge of the data being analyzed. Although simple linear regression works for the temperature example, in most real-life cases more than one predictor influences the target variable. To make predictions in the case of multiple independent variables, we would use multiple linear regression. In our previous example for GB, we might find that in addition to population, per capita income and number of clear weather days influence GB bicycle sales (Z). Multiple linear regression expresses the dependent variable as an equation of the independent variables. All variables must be numeric. The equation looks like this: Z = a + bX + cY + … Z is the value we want to predict, X and Y are independent variables, a is the Z intercept, and b and c are the slopes of the line with respect to X, Y. An example of multiple linear regression is predicting someone’s weight based on age and height. Age and height are independent variables. Weight is the dependent variable. The multiple linear regression model for weight would be an equation as follows where, a, b, and c are coefficients (factors or numbers) that will be determined by the regression method. weight = a + bx age + c x height By adding more independent variables, perhaps gender or ethnicity, we may be able to predict weight more accurately. Although linear regression models are easy to understand and to use to estimate future numerical values of a variable, more complex models such as decision trees, neural networks, and support vector machines may be appropriate in cases where the data cannot be fit into an arithmetic formula. We discuss these models later in the chapter. Classification Models In the previous section we considered problems that analysts can address utilizing estimation models. In this section we use the following example to illustrate a business problem that can be solved using classification models. Donna Vasant, a GB business analyst, has been instructed to send out brochures and emails to high-end bicycle stores around the nation based on a list she purchased from a marketing company. Donna wants to identify potential customers, but she does not want to spend money and time sending out brochures to stores that are unlikely to respond. Her strategy is to utilize data that consist of store responses to prior marketing efforts to predict whether a store will respond to her new brochures. She will use classification modeling to group the customers accordingly. Donna has collected the following data about the stores in her prior marketing campaign: annual revenue, city, state, number of brands carried, years in existence, number of clear weather days, number of sunny hours, per capita income in the city, and response to the marketing campaign. Figure 11-10 displays sample data from 10 stores in various cities. FIGURE 11-10: DATA ABOUT BICYCLE STORES In all, 250 stores nationwide had received an email and a brochure about GB. The response column shows whether a store responded to Donna’s marketing efforts. A “Yes” response indicates the store called GB for more information or signed up for more information on the GB website. At both touch points, GB captured the store’s interest for follow-up. A “No” response was recorded if the store sent an explicit “not interested” response or simply failed to respond within 30 days. As a matter of policy, GB will not solicit these customers again for at least a year. The target variable here is Response. In classification modeling, the target variable has to be categorical. In other words, the customers are placed into one of several possible categories such as Yes/No, Buyer/Non-Buyer, or Accept/Reject/Follow-up. The independent (predictor) variables need to be numeric. If they are not, then they should be converted to numeric. We address this topic in greater detail later in the chapter. Recall from earlier in the chapter that classification is the process of identifying the particular category or class to which a new record—that is, a new row of data—belongs. A classification model classifies a categorical target variable based on a set of independent variables that may be either numerical or categorical. In this case, GB wishes to classify those customers who responded to Donna’s marketing efforts and those who did not—two possible responses. The company will use the results of the classification to help predict how new customers will respond to the campaign. Recall that in Chapter 9 we discussed clustering methods, a type of unsupervised analytics that also groups data. You might be tempted to recommend that Donna use clustering to help with her marketing campaign question. However, clustering and classification models are not the same, and they do not provide the same results. There are two primary differences between them. First, clustering is an unsupervised technique for descriptive analysis, whereas classification is a supervised model for prediction. Second, clustering creates groups comprised of similar entities without prior knowledge of what the clusters or categories will be. In contrast, categories are already known in classification modeling; for example, Yes or No, or Spam or Not Spam. Classification Modeling Steps Classification is performed using the following steps: 1. Model Building (training): This is the step in which the model is trained. The classification algorithm is trained on a training partition of the complete dataset. In this case, Donna selects the data for 200 random stores to serve as a training set. 2. Model Validation: After the classification algorithm is trained, it can be tested for accuracy against a validation partition set of the 50 remaining stores. The results of the validation dataset are presented as a confusion matrix, which is used to evaluate the performance of the classification model. The matrix makes it easy to determine whether the model confused the two classes—in Donna’s case, responses; thus the title “confusion.” The results of Donna’s analysis are represented in Figure 11-11. The total number of stores is 50. The classification algorithm predicted that 35 stores would respond Yes and 15 stores would respond No. Because this was a validation dataset, we know the actual outcome as well. Of the 35 stores that were predicted to be Yes, 32 (91%) did, in fact, say Yes. That is a rather accurate prediction. In contrast, of the 15 stores predicted to say No, 8 (53%) actually responded Yes. That prediction clearly was not nearly as accurate. 3. Model Testing: The third partition that is set aside is used to compare the accuracy of the trained model. The test partition should not be used to fine-tune the model because this testing is supposed be an unbiased assessment of the model accuracy. This test may also be used to compare different classification models to determine which one works best on unseen cases. FIGURE 11-11: CONFUSION MATRIX The predicted and actual outcomes can be generalized as depicted in Figure 11-12. True positives are called hits (32), and false negatives are called misses (8). False positives are called false alarms (3), and true negatives are called correct rejections (7). FIGURE 11-12: CONFUSION MATRIX – TRUE AND FALSE The objective of the classification algorithm is to maximize the true positives and true negatives and to minimize the false negatives and false positives. Therefore, the accuracy of the algorithm can be measured as the total of true positives and true negatives divided by the total number of records. In this case the calculation is: (32+7) / 50 = 39/50 = 78%. Several other ratios can be computed to further describe the accuracy of the model. For example, the false positive rate is calculated by dividing the false positives by the predicted negative cases. In this case, 8/15 = 53%, which is quite high. Refer back to Figure 11-10 and note the 10 fields. The first 9 are the independent variables. The last one, “Response,” is the dependent variable. You might wonder which independent variables influence the store’s response. Does the age of the store matter? What about the number of clear days in that city? Clearly, store number is not an influencing factor. What if there were 100 variables? How do we pick the most important set of variables for classification? The process of picking the most important variables is called feature selection. There are several techniques available to conduct feature selection. One common technique is singular value decomposition (SVD). SVD reduces the dimensionality—that is, the number of dimensions within the data—of a dataset to the most important dimensions. We had mentioned that all independent variables need to be numeric. Donna is confronted with the issue that City and State of the customer are not numeric. How can she use classification modeling? She can use dummy variables to convert City and State into numeric variables. For each City in the dataset, she created a new numeric variable for that City name. Every customer/store that belongs to that city gets a 1 in their row for that City name. For example, in Figure 11-13, Store #1 which is located in New York City, will get a 1 in the New York City column. Other columns for store #1 get a 0. Effectively, Donna has converted the categorical variable City into several numeric variables, each with a unique city name. FIGURE 11-13: USING DUMMY VARIABLES TO CONVERT CATEGORICAL VARIABLES TO NUMERIC VARIABLES Similarly, if the dependent variable, or target, was numerical, then there is a clever way to make it categorical. This strategy is called binning. As the name suggests, binning places numeric variables values into discrete bins. The bins have labels. Converting the values to labels effectively transforms a numeric variable into a categorical variable. For example, Donna’s target variable is profit margin, as illustrated in Figure 11-14. FIGURE 11-14: PROFIT MARGIN BY STORE To bin the margin, Donna might create the table in Figure 11-15. Notice that the numerical data are now defined categorically by bins. Specifically, 0% to 10% margins are shown as customer category “Iron,” 10% to 20% margin customers are shown as “Bronze,” and so on. FIGURE 11-15: AN EXAMPLE OF BINNING PROFIT MARGIN Now that margin has been converted into a categorical variable, we can train the model using one or more classification models. Indeed, you may be wondering how we trained the model in the first place. Validation was straightforward, but how was the classification model trained? There are numerous techniques, or classifiers, for building classification models. We discuss the following: Naïve Bayes K-nearest neighbors (KNN) Decision trees Logistics regression Neural networks Genetic algorithms Support vector machines (SVM) Naïve Bayes Classifier A simple yet powerful classification technique is the Naïve Bayes classifier. This technique is based on Bayes’ theorem from statistics. This theorem focuses on conditional probability, which is the probability that an event A will occur given that another event B has occurred. For example, what is the probability a customer responds Yes given that the customer is located in New York City? We can take this question further and ask, “What is the probability a customer responds ‘Yes’ given the values of all independent variables from Figure 11-10 except store number?” Store number is just an identifier for the customer, so it has no influence on customer response. To understand how the classifier works, consider the data in Figure 11-16. There are four predictors and one target. The target is Response; specifically, that a customer would respond Yes to the marketing campaign. FIGURE 11-16: RESPONSES FROM CUSTOMERS We want to compute the following probability: P(Response = Yes |Annual revenue = High, City = New York, State = NY, Years in Business = 10) In other words, what is the probability that a customer will respond Yes if that customer has High revenue, is located in New York City, and has been in business for 10 years? We could create a list of all possible combinations of the predictor variables. That would be a very long list of all possible levels of annual revenue (3), all cities in all states (100s), and the number of years in business (0-20). Then we would tabulate the response from our existing customer corresponding to the set of variables in the equation above. The naïve thing to do would be to calculate all such probabilities and then use them to predict the probability that a new customer would respond Yes. This technique is called Exact Bayes. The problem with this approach is that the probabilities are not reliable because the sample size for each combination may be too small or even nonexistent. Therefore, we need to find some way to compute the probability of the target for a combination of predictors for which we have no data. The Naïve Bayes model assumes that the impact of the value of one predictor is independent of the value of other predictors; for instance, the impact of city on Response is independent of the impact of revenue on Response. Because of this assumption, Bayes’ theorem can be used to estimate the probability of every combination of independent attributes. This means that instead of looking for every combination of predictor variables, we compute the probability of individual predictors and then combine them to estimate the probability of the target. Bayes’ theorem: P(A|B) = P(B|A)P(A)/P(B) P(A|B) is the conditional probability of A given B, P(B|A) is the conditional probability of B given A. P(A) and P(B) are probabilities of A and B. They are assumed to be independent of each other and are non-zero. How does this help us? Naïve Bayes classification, even in the situation where we have no data for a predictor variable, can classify the target based on the other predictor variables. It does this by decomposing the overall probability into individual probabilities. To understand how this process works, consider T as the target and P as the set of predictors. Extending Bayes’ theorem from one predictor B to multiple predictors p1, p2, p3… we get P(T|p1,p2,p3,p4,..) = P(T|p) = P(p|T)P(T)/P(p) P(T|p) = P(p1|T) x P(p2|T) x P(p3|T) x…x P(T) where p is the set of predictors, T is the target. So, while we may not have the probability of T given a particular combination of predictors, Naïve Bayes does a clever separation into the conditional probability of each predictor given T! After the probabilities are estimated, any new case can be predicted for its outcome. As an example, if we have 5,000 potential new customers, then we use the Naïve Bayes model to compute a probability for each potential customer to respond positively. We can then sort potentials in decreasing order of probabilities, and we can choose the top N (say, 100) for marketing purposes. K-Nearest Neighbors Another method for classification is K-nearest neighbors (KNN). This model is based on a simple observation that a case is most likely to be similar to its nearest neighbors. For example, the price of a house is most likely to be similar to the price of other homes in its immediate vicinity. The question arises: How many nearest homes should we consider? Also, are there other factors we should consider, such as number of rooms, square footage, and school district? Is there a way to compute “nearness” when factors like square footage are to be included? GB, for instance, would like to classify a new customer into either a premier or a standard customer based on known attributes such as the city in which the customer is located, the population of the city, and other predictor factors. A premier customer is likely to be a high-revenue customer; a standard customer is not. The number of other customers we compare to our new customer is called K. K in KNN represents the number of neighbors we choose to consider when classifying new cases. Consider a simple example where the only factor we consider for classification is the city and K=1. Based on this model, if we evaluate a new customer in San Diego and our only existing customer in San Diego is a premier customer, then we will classify the new customer as a premier customer. We could try K=2, but then we risk the possibility of a tie. Instead, we choose K=3 or other odd numbers. As we increase the value of K, our new San Diego customer could end up with a different classification. What happens if we include other factors such as the number of clear weather days and city population? We then use a “distance” formula to compute the distance of the attributes of the new customer from the attributes of its neighbors. We calculate this distance after we have scaled each factor to the same scale, often 0 to 1, so that each factor is considered equally. That is, we scale both the number of clear weather days and the population to a value between 0 and 1. After we have scaled all of the predictors, we measure the distance of the new customer from all other customers. We then note the K nearest neighbors’ classification. If the majority of the neighbors are premier customers, then we classify the new customer as such. This, in essence, is the KNN model. Note that all predictor variables should be numeric in order to compute a “distance.” Therefore, for the KNN to work, a predictor variable such as city would have to be converted to a dummy variable, as discussed earlier. For KNN, we use two training partitions called training A and training B and a third partition, the validation partition. For every row in training B, we find K nearest neighbors in training A. We use the majority response of the K neighbors in A and compare it to the known response in training B. We proceed along these lines until all training B rows have been classified. Because we know the actual response and the KNN response, we can build a confusion matrix. The confusion matrix has an accuracy that can be computed as previously discussed. The next step is to vary the value of K to K = 1, 3, 5, 7, etc. We then choose the K that maximizes the accuracy. Finally, using the best K, we validate the chosen data model’s performance with the validation partition. Notice that when we vary the values of K, we chose from only odd values. Even values of K pose a problem because it is possible that when we compare a row in training B to two nearest neighbors in training A, we could get a tied vote; that is, the response from one neighbor is opposite of the other neighbor. To avoid ties in KNN, we choose K to be odd values. So, we only test K = 1, 3, 5, 7, and so on. Decision Trees Recall from Chapter 9 that a decision tree is a tree-like diagram of rules that consist of decisions and their outcomes. There are two types of trees: classification trees and regression trees. As the name suggests, classification trees are used to classify cases into categories: The outcome is a categorical variable. In contrast, in regression trees the target variable is a continuous variable, such as the price of a product or manufacturing time. In this chapter we will discuss only classification trees. The simplest classification tree is a binary category such as True/False or Yes/No. However, categories do not have to be binary. Figure 11-18 displays a simple decision tree used to determine “survivability” of passengers on the Titanic. A total of 1309 passengers are included in the classification. The classification is a binary categorical variable: either Survived or Not Survived. Several predictor variables are available from historical records of the Titanic passengers. We list them in Figure 11-17. FIGURE 11-17: LIST OF VARIABLES FOR TITANIC DATA SET HTTPS://WWW.KAGGLE.COM/C/TITANIC/DATAiii A decision tree to classify the Titanic data has been trained using the standard training and validation data partitions. The train dataset has 891 passengers; the validation dataset has 418 passengers. The resulting tree is diagrammed in Figure 11-18. Here’s how you read a tree: Start at the top, where the value of a predictor variable leads you to the next variable by following one branch of the tree. Then, go to the next branch, and so on. The tree tips are classification points where the case—in our example, individual passenger—is classified into Survived or Not Survived. FIGURE 11-18: SIMPLE DECISION TREE FOR TITANIC SURVIVABILITY20 (RESULTS FROM R – RATTLE) Each decision point within the tree is called a node. The decision is made on a value of a predictor variable. The same predictor variable can appear more than once in a tree. In Figure 11-18, “SibSp>=2.5” and “Fare >=18” are nodes. The node at the top of a decision tree is called the root node. The root in this figure is “Sex=Male?” The nodes at the tips of the tree are called leaf nodes. Leaf nodes are the categories we are trying to classify. The bottom row of nodes are all leaf nodes. The rest of the nodes are called interior nodes. The nodes are all numbered in this example, which was built using the R programming language. The number at the top in each node is either 0 or 1, corresponding to the survivability category in which the decision tree classified it. In this tree, 1 represents Survived, and 0 represents Not Survived. The percentage in the bottom of the node is the percentage of passengers who belong to that node. Finally, the two decimal numbers listed in each node denote the number of passengers who survived (right) or did not survive (left). For example, consider node 27, which is reproduced in Figure 11-19. It includes 6% of all passengers. They were all classified as Survived (1) because a majority—70%—of passengers in that node survived. The remaining 0.30 did not survive. FIGURE 11-19: NODE DETAILS The rule for node 27 is: If Sex=Male, and Pclass >=2.5, and Fare < 23, and Embarked ∦ S. Refer to the tree in Figure 11-18, and follow the branches to verify the rule. This rule shows that it is accurate 70% of the time. In other words, 30% of the passengers with this rule were misclassified as Survived; that is, with the same rule 30% of the passengers did not survive. If you add all of the leaf node percentages, the sum will be 100%, which represents all of the passengers. Node 4 has the largest number of passengers with the rule Sex = Male and Age >= 6.5. These passengers are classified as Not Survived. And, it is 83% accurate. The intensity of a node’s color in this diagram is proportional to the value predicted at the node. The greens are Not Survived; the blues are Survived. The more intense the color, the more coherent the node is; that is, the higher the percentage of passengers who belong to the majority classification. To illustrate this point, let’s compare nodes 12 and 52. In node 12, 89% of passengers did not survive. In node 52, 59% of passengers did not survive. Thus, node 12 is more coherent than node 52, meaning it has fewer errors or misclassified passengers (only 11%). For this reason, node 12 is a darker green than node 52. What percentage of all passengers were misclassified by this decision tree? You can calculate this value by adding all of the percentages in the leaf nodes that do not belong to the node classification. For example, in node 4, 17% of the passengers were misclassified. This node includes 62% of all passengers. We can then compute the overall misclassification as Percentage misclassification = 0.62 x 0.17 + 0.01 x 0.11 + 0.02 x 0.00 + 0.03 x 0.11 + 0.04 x 0.41 + 0.01 x 0.03 + 0.02 x 0.19 + 0.06 x 0.30 + 0.19 x 0.05 = 0.16. So, 16% of all passengers were misclassified. Therefore, the decision tree has an accuracy of 84% in the training dataset. As was the case with our other predictive models, a classification decision tree is trained on a training dataset and then validated on a partition set. After the tree is trained and tested, it is used to classify new cases. A case can represent an individual customer, a passenger, a store, and many other entities. Logistic Regression Logistic regression is a classification model in which the dependent, or predicted variable, is categorical; that is, there are groupings into which a case is classified. This method is used to classify a case into one of several categories based on a number of predictor variables or attributes. It differs from simple linear regression, where the dependent (or predicted) variable is numeric and continuous. GB could use logistic regression to predict whether a certain customer will purchase a certain product. For example, will the customer Bavaria Bikes purchase the Professional Touring Bike? The predicted outcome is either Yes or No. The independent variables can be either discrete or continuous. In this case a discrete variable is whether a customer is wholesale or retail, and a continuous variable is sales revenues from the previous year. Another application of logistic regression is to predict customer churn. Churn, or customer attrition, is the loss of a customer to a competitor. In addition to the loss of potential revenue from a customer, churn leads to the costs of marketing to and acquiring new customers to replace those that left. Based on the past history of customer churn, logistic regression can help classify a customer as either likely or unlikely to churn. Neural Networks Neural networks, or, more precisely, artificial neural networks (ANN), are a type of machine learning that is based on biological neural networks such as a human (or animal) brain. The brain is made up of a vast network of interconnected neurons that transmit messages to one another. This interconnectedness enables the brain to learn and adapt over time. ANN try to mimic this capability using computer programs. They are applied in classification tasks such as image recognition that are easy for humans to perform but very difficult for computers. As mentioned in the previous section on estimations, ANN may also be applied to complex estimation models. The simple diagram in Figure 11-20 illustrates how ANN work. The circles represent nodes or neurons; that is, interconnected programs. Each node receives an input. The program in the node processes the input and generates an output. Each connection has a weight associated with it, with higher weights indicating greater importance. The network has the capability to learn over time by adjusting the weights of the connections. If the network of neurons generates a desirable overall output, then the weights are left alone. If the output is undesirable, then the weights are modified. Over time this supervised learning leads to a smarter ANN. ANN are particularly adept at pattern recognition. Examples of pattern recognition are handwriting, fingerprints, faces within an image, and voice recognition. FIGURE 11-20: AN ARTIFICIAL NEURAL NETWORK WITH ONE HIDDEN LAYER A common application of ANN is to predict credit risks. GB could use ANN to evaluate the credit risk of new customers to reduce the likelihood of extending credit to a customer that will not repay. The company could classify new customers into good risk and bad risk categories based on attributes such as income, number of credit cards, location, and others of the company’s choosing. Neural networks with multiple hidden layers are part of a family of machine learning algorithms called deep learning. Deep learning can create more accurate predictions because the model has more neurons to adapt to the data. However, although deep learning models are enticing because of their predictive capabilities, they have certain limitations. First, they require massive computing power to run. They also have a tendency to overfit the data; that is, the resulting data fit too closely to the original data and thus are unable to predict accurately using data dissimilar to the original dataset. We discuss overfitting in more detail in Chapter 12. Due to these shortcomings, deep learning models are not always the ideal solution for predicting results. However, due to the complex nature of AI, deep learning models and multilayer neural networks are critical components of AI. Genetic Algorithms “Nature is the best teacher,” it is often said. We can learn from nature. Specifically, we can learn by studying how natural evolution has led to the complex variety of species that are best suited for their surrounding conditions. Evolutionary theory is based on the concepts of variation and natural selection. Variation in species occurs over time due to random mutations that are introduced when parents mate to create offspring, a process known as recombination. These mutations introduce random modifications into the offspring. Those mutations that are beneficial to the species’ survival are more likely to be passed on to future generations. In other words, the environment “selects” these traits. This process of variation and natural selection has led to the complex diversity in living beings on our planet. What does evolution have to do with classification or predictive data mining? Computer scientists who work in artificial intelligence or in machine learning have devised methods to mimic evolution in their programs. These methods are called genetic algorithms (GA). GA use the same process of selection, recombination, and mutation to find solutions to complex problems. As an example, the engineers of the research and design team at GB have been experiencing challenges with a new material used to fabricate an experimental racing bicycle that they expect will revolutionize competitive biking. They are not certain whether the problems are due to the material or to the design itself. Although they have tried several iterations of the design, they have yet to resolve the problems. By feeding the data from both the failed and earlier successful design attempts into a GA model, the engineers expect that the GA will be able to isolate the problem areas and recommend results. The evolution of the design should lead to another successful bicycle innovation at GB. Support Vector Machines Support vector machines (SVM) are a type of machine learning. SVM can be used for both estimation and classification. When we use SVM in classification, it groups new cases into one of two classes. For example, GB could categorize customers into premier and standard customers based on factors such as the number of years they have been with GB and the amount of revenue they bring. SVM then classifies a new GB customer as either a premier or a standard customer. The mathematics behind SVM is beyond the scope of this book. However, it is sufficient for you to know that SVM is used to create a boundary between two classes or categories so that it can classify new cases into one of these categories. A very successful application of SVM is to filter email spam. Unwanted email is a reality in today’s electronic communication world. Over the years, computer scientists have tackled the issue of identifying spam by using SVM. The SVM algorithm analyzes the email and its contents. It then determines whether an email is spam (Figure 11-21). Over time, as the number of classified cases increases, the accuracy of the model also increases. That is, more spam emails are classified as spam, and more non-spam emails are classified as non-spam. Email users can help the algorithm along by manually correcting the classification of incorrectly classified emails. FIGURE 11-21: SVM FOR SPAM DETECTION SUMMARY In this chapter, we examined various data models that enable us to make predictions. We considered the two basic types of predictive models: estimations and classifications. It is important to note that predictive data models are supervised; that is, they need to be trained, validated, tested, and run in real-world scenarios. Not all models predict with a high degree of accuracy. Therefore, all models should be evaluated for their reliability. Predictive models are periodically evaluated and retrained. This eBook is licensed to Siddharth Kothamasu, [email protected]
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.