Machine Learning Question
https://embios.quarto.pub/example/
2024/4/24 14:51 BIOS 534 Homework 3 BIOS 534 Homework 3 Important Information Last Update: 04/09/24 30 Points (see individual sections for specific point values) Due by 11:59 PM (Canvas Clock Time) on April 24, 2024 No extensions will be granted (Please let us know of medical emergencies ASAP) Read problem descriptions carefully. Make note of any random_state values that need to be set Send questions to TAs [email protected] and [email protected] Please use Python scikit-learn and Jupyter / Jupyter Lab to perform your work. Please compile all results into a single PDF file and submit via Emory Canvas. Code must be included in the submission. In Jupyter or Jupyter Lab you can use File -> Save and Export Notebook As -> PDF to create a PDF file. If you have problems with this procedure please contact the TAs for advice. If you cannot get the PDF generated then submit your Jupyter Notebook though please inform the TAs that you were unable to create a PDF. Academic Honesty policies apply. You may not share, accept, or co-develop code for any of the problems. This includes use of AI technology (chatGPT) or CoPilot to arrive at a solution. All code is to be your own. You may not share or accept code from current or former students of this class or any other source. 1. Regression (15 Total Points) Introduction Please access the URL below to obtain a data set which represents data related to standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland around the year 1888. This represents a data frame with 47 observations on 6 variables, each of which represents a percentage. All variables but ‘Fertility’ give proportions of the population. https://raw.githubusercontent.com/steviep42/bios_534/main/data/swiss.csv We are interested in predicting Infant_Mortality Variable Name Fertility https://embios.quarto.pub/example/ Description Common standardized fertility measure 1/11 Fertility Common standardized fertility measure Variable Name Description Agriculture Percentage of males involved in agriculture as occupation Examination Percentage of draftees receiving highest mark on army examination Education Percentage of education beyond primary school for draftees Catholic Percentage of ‘catholic’ (as opposed to ‘protestant’) Infant_Mortality Percentage of live births who live less than 1 year The first 5 rows should look like the following: 2024/4/24 14:51 Fertility 0 80.2 1 83.1 2 92.5 3 85.8 4 76.9 BIOS 534 Homework 3 Agriculture 17.0 45.1 39.7 36.5 43.5 Examination 15 6 5 12 17 Education 12 9 5 7 15 Catholic 9.96 84.84 93.40 33.77 5.16 Infant_Mortality 22.2 22.2 20.2 20.3 20.6 a) (1-point) Training Data 1. Create a training and testing pair of data sets with a 75/25 Training / Test ratio 2. Specify a random_state of 0 when creating the train_test_split 3. We will be predicting Infant_Mortality 4. All other columns will be considered as predictors until otherwise instructed X_train shape: (35, 5) X_test shape: (12, 5) b) (1-point) Decision Tree Regressor Create a Decision Tree Regressor estimator with a random state of zero and then fit the estimator using the training data. c) (1-point) Tree Depth Present the depth of the tree. Your answer does not need to be identical to my answer below. d) (1-points) RMSE Present the RMSE (Root Mean Squared Error) for both training and testing data. Your answer does not need to be identical to the below output but it shouldn’t be far off. Observed Tree Depth: 11 Train RMSE: 0.0 Test RMSE: 4.375 https://embios.quarto.pub/example/ 2/11 2024/4/24 14:51 BIOS 534 Homework 3 e) (3-points) Tree Depth Experiment (3 points) Decision Trees Regressors in scikit-learn have an argument max_depth that allows you to specify the maximum depth of any tree you build. Using the training and testing data from above, build a series of single Decision tree estimators of specified maximum depths varying from from 1, 2, 3, … 20. A for loop will be helpful here. As you loop through the fitting process of the Decision Tree Regressor with varying maximum depth values, keep track of the training and test RMSE values emerging from each prediction. You could keep this information in lists or a data frame. In my case I have the following RMSEs. Note you can use your mouse or track pad to scroll to the right on the below numbers. Train RMSE: [2.231, 1.955, 1.731, 1.428, 0.918, 0.655, 0.337, 0.154, 0.065, 0.024, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] Test RMSE: [3.521, 3.304, 3.258, 4.033, 3.84, 4.436, 3.712, 3.805, 4.443, 3.886, 3.805, 3.756, 3.818, 3.847, 3.817, 4.365, 3.834, 3.886, 3.921] f) (2-points) Boxplot Create boxplots of the Training and Testing RMSE https://embios.quarto.pub/example/ 3/11 2024/4/24 14:51 BIOS 534 Homework 3 g) (1-points) RMSE Plot Create a scatterplot plot of the training and testing RMSE lists vs the max_depth values. h) (2-points) OverFitting vs UnderFitting? Describe what is happening in the plot in terms of fitting (over or under). Does the model generalize well to the test data? How closely is the model following the training data? i) (3-points) Preventing Overtraining There are various arguments for a Decision Tree estimator that might keep an estimator from being over fit to the training data. For example, consider the following Decision tree regressor plot. Note that this is just an example. https://embios.quarto.pub/example/ 4/11 2024/4/24 14:51 BIOS 534 Homework 3 Focus on the the bottom row which has leaf nodes which is a term used to describe the terminal nodes of the tree where no further splits are made. These nodes represent the final decision or prediction that the model makes. You’ll notice some leaf nodes have only 2 samples, suggesting the tree may be overly tailored to some of the training data. A more general model would usually have more than 2 samples in a terminal node. Check the documentation for Decision Tree regression for a parameter that allows you to set a minimum samples for leaf nodes. Here are some of the parameters. min_impurity_decrease min_samples_leaf min_samples_split min_weight_fraction_leaf You will set this argument to obtain a more realistic (not perfect) training RMSE. Repeat the previous experiment in question 1d) with the additional parameter you identify. Set it to a specific value. https://embios.quarto.pub/example/ 5/11 2024/4/24 14:51 BIOS 534 Homework 3 As in 1e), plot the RMSE for both training and testing. The new plot should show a higher Training RMSE, though it may not be identical to that below. But it should definitenly NOT be 0.0 or close to it. 2. Classification (15 Points) Next we’ll work with classification as it relates to some Appendicitis data. This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. There are missing values in this dataset which will need to be addressed. Introduction Let’s start by reading the file into a data frame called app Here is the URL. The first 5 rows will look like the following: https://raw.githubusercontent.com/steviep42/bios_534/main/data/append.csv Age Height Alvarado_Score 0 12.68 148.0 4.0 1 14.10 147.0 5.0 2 14.14 163.0 5.0 https://embios.quarto.pub/example/ Weight WBC_Count 37.0 7.7 69.5 8.1 62.0 13.2 CRP Body_Temperature 0.0 37.0 3.0 36.9 3.0 36.6 RDW Diagnosis 12.2 1 12.7 0 12.2 0 6/11 2024/4/24 14:51 BIOS 534 Homework 3 Age Height Alvarado_Score Weight WBC_Count CRP Body_Temperature RDW Diagnosis 3 16.37 165.0 7.0 56.0 11.4 0.0 36.0 13.2 0 4 11.08 163.0 5.0 45.0 8.1 0.0 36.9 13.6 1 a) (1-point) Missing Values Determine the number of missing values per column. Age 0 Height Alvarado_Score Weight WBC_Count 25 50 2 4 CRP Body_Temperature RDW Diagnosis 9 5 24 0 dtype: int64 b) (2-point) Imputation 1. Next, use Simple Imputer package to replace missing values in each column with the most frequently occurring value in that column. 2. Save the resulting datafame to a new data frame called app_imputed 3. Verify that all missing values have been replaced. 4. You can also now check the mean value of app_imputed.Alvarado_Score to make sure you used the right replacement strategy. Age Height 0 0 Alvarado_Score Weight WBC_Count CRP Body_Temperature 0 0 0 0 0 RDW Diagnosis dtype: int64 0 0 app_imputed Alvarado_Score mean value: 5.863 c) (1-point) Training Data Using app_imputed create new X and y objects and create a 75/25 training and testing split with a random_state=0 X_train shape: (585, 8) X_test shape: (195, 8) https://embios.quarto.pub/example/ 7/11 2024/4/24 14:51 BIOS 534 Homework 3 d) (1-point) Random Forest Build a RandomForest using the above created training and testing data. 1) Make sure that the minimum number of samples in a leaf node is 5. 2) Print the classification report for both the training and testing data. Classification Report for Training Data precision recall f1-score support 0.86 0.85 0.77 0.92 0.81 0.88 238 347 0.86 585 0.85 0.85 585 585 0 1 accuracy macro avg weighted avg 0.86 0.86 0.84 0.86 Classification Report for Testing Data precision recall f1-score support 0 0.70 0.49 0.58 79 1 0.71 0.85 0.78 116 0.67 0.71 0.71 0.68 0.70 195 195 195 accuracy macro avg weighted avg 0.70 0.71 e) (2 points) Recall The classificiation_report function has a lot of information which might be too much if we are interested in only the recall for each class we are predicting (0 or 1). In the above example, focus on the Training Data report and you will see a recall for each class (0 and 1) of approxomately 0.77 and 0.91 respectively. In the Testing data the recall for each class (0 and 1) is 0.49 and 0.85 respectively. We want just this information for both the training and testing data. Check the sklearn documentation for a metric method that can provide this information more directly. We want this info for both training and testing. Do NOT attempt to parse the text returned by the classification_report function: Relative to the above Decision Tree estimator: 1. Print out only the recall for classes 0 and 1 for Training and Testing data sets 2. Round the result to 2 decimal places. 3. Check your result for this question against the classification reports in 2d) Training: https://embios.quarto.pub/example/ 8/11 2024/4/24 14:51 BIOS 534 Homework 3 Class 0: 0.77 Class 1: 0.92 Testing: Class 0: 0.49 Class 1: 0.85 f) (1-point) Variable Importance Create a horizontal bar plot of feature importance in decreasing order which should resemble the following: g) (2-point) Dropping Variables So based on the above plot, 1. Create a new version of X (call it Xnew) that drops the 3 lowest variables in terms of magnitude. 2. Using Xnew and the existing y, regenerate a new training and testing split 75/25 (random_state=0) 3. Repeat the random forest creation in 2d) (make sure min_samples_leaf=5 and random_state=0) 4. Print out the class recall values for the training and testing data. https://embios.quarto.pub/example/ 9/11 2024/4/24 14:51 BIOS 534 Homework 3 Training: Class 0: 0.76 Class 1: 0.91 Testing: Class 0: 0.48 Class 1: 0.84 h) (2-point) Effect What impact did dropping the three lowest variables have on the recall for both the training and testing data? In particular what efffect did dropping have on the model – if any? i) (2 points) Scaling Let’s use a different method to see if we can observe a comparable or even better result. This might not be possible but we’ll never know unless we try. Support Vector Machines are not “scale invariant” so we have to scale the training and test data. Look at the notebooks for guidance but basically you can use the StandardScaler to do this. Using the X_train and X_test data from above 1. Create a scaler transformer 2. Do a fit_transform on X_train and save it into a new object called X_train_scaled 3. Do a transform (not fit_transform) on X_test and save it into a new object called X_test_scaled X_train_scaled – First 5 rows [[-1.98581692 0.53260711 -1.59371777 0.3733249 -0.3929034 ] [ 1.05686742 -0.40870757 0.19414116 -0.73954024 -0.553926 ] [ 0.02505154 [-2.43430507 [ 0.68459294 1.47392179 -0.82168778 1.14667187 2.30869799] 0.06194977 -1.77946935 0.54308399 -0.4107948 ] 0.53260711 0.60047273 -0.30571145 -0.464469 ]] X_test_scaled – First 5 rows [[-0.7539286 [-0.57175092 1.47392179 -0.59530304 0.78829157 -0.285555 ] 1.47392179 -0.1309241 -0.02277964 -0.4286862 ] [ 0.52044042 -1.82067959 0.8326622 -0.47547054 -0.4286862 ] [ 1.32361528 1.00326445 0.54242537 0.84487793 -0.464469 ] [-1.36948052 -0.87936491 -1.1409483 -1.17336902 -0.553926 ]] j) (1 points) Support Vector Machine 4. Fit the SVC classifier with a random_state=0 5. Get the recall info for both classes (0 and 1) for the Training and Testing Dat 6. Print it out https://embios.quarto.pub/example/ 10/11 2024/4/24 14:51 BIOS 534 Homework 3 Training Recall: Class 0: 0.64 Class 1: 0.8 Testing Recall: Class 0: 0.58 Class 1: 0.84 https://embios.quarto.pub/example/ 11/11
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.