This case-study examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS). A major clinically relevant question in t
This case-study examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS). A major clinically relevant question in this biomedical study is: What patient phenotypes can be automatically and reliably identified and used to predict the change of the ALSFRS slope over time?. This problem aims to explore the data set by unsupervised learning (you only need to work on K mean in this assignment).
- Load and prepare the data.
- Perform summary and preliminary visualization (i.e. show the clustering with selected features).
- Train a k-Means model on the data with selected features (3 or more), experiment at least two different k values, and explain which k value is a better choice.
- Evaluating the model performance by report the center of clusters.
- Visualize the final clustering result.
Submit Python code, report that explains the k experiment, performance evaluation, and visualizations.
Case-Study 15
Amyotrophic Lateral Sclerosis (ALS)
Overview: This case-study examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS). ALS demands conducting clinical trials and collecting big, multi-source and heterogeneous datasets that can be interrogated to derive potential biomarkers. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols (pipelines/workflows) starting with acquiring raw data, preprocessing, aggregation, harmonization, analysis, visualization and result interpretation.
The clinical data shows that the rate of ALS progression varies significantly among patients. Majority of the patients die within 3 to 5 years after ALS onset, however, a few are able survive for over 10 years. This heterogeneity of disease course hinders demonstration of its biological mechanism and development of effective treatment. We need to develop reliable predictive models of ALS progression to understand the pathophysiology of the disease.
Driving Challenges:
· What patient phenotypes can be automatically and reliably determined?
· Predict the change of the ALSFRS slope change using the holistic patient-specific data.
· Predict survival of patients at a given time-point (post diagnosis).
Meta-Data
· There are 2 datasets:
· training (N1=2,223): ALS_TrainingData_2223.csv, and
· testing (N2=78): ALS_TestingData_78.csv
· Each dataset includes the following 131 variables:
ID; Age_mean; Albumin_max; Albumin_median; Albumin_min; Albumin_range; ALSFRS_slope; ALSFRS_Total_max; ALSFRS_Total_median; ALSFRS_Total_min; ALSFRS_Total_range; ALT.SGPT._max; ALT.SGPT._median; ALT.SGPT._min; ALT.SGPT._range; AST.SGOT._max; AST.SGOT._median; AST.SGOT._min; AST.SGOT._range; Basophils_max; Basophils_median; Basophils_min; Basophils_range; Bicarbonate_max; Bicarbonate_median; Bicarbonate_min; Bicarbonate_range; Bilirubin..total._max; Bilirubin..total._median; Bilirubin..total._min; Bilirubin..total._range; Blood.Urea.Nitrogen..BUN._max; Blood.Urea.Nitrogen..BUN._median; Blood.Urea.Nitrogen..BUN._min; Blood.Urea.Nitrogen..BUN._range; BMI_max; bp_diastolic_max; bp_diastolic_median; bp_diastolic_min; bp_diastolic_range; bp_systolic_max; bp_systolic_median; bp_systolic_min; bp_systolic_range; Calcium_max; Calcium_median; Calcium_min; Calcium_range; Chloride_max; Chloride_median; Chloride_min; Chloride_range; Creatinine_max; Creatinine_median; Creatinine_min; Creatinine_range; Eosinophils_max; Eosinophils_median; Eosinophils_min; Eosinophils_range; Gender_mean; Glucose_max; Glucose_median; Glucose_min; Glucose_range; hands_max; hands_median; hands_min; hands_range; Hematocrit_max; Hematocrit_median; Hematocrit_min; Hematocrit_range; Hemoglobin_max; Hemoglobin_median; Hemoglobin_min; Hemoglobin_range; leg_max; leg_median; leg_min; leg_range; Lymphocytes_max; Lymphocytes_median; Lymphocytes_min; Lymphocytes_range; Monocytes_max; Monocytes_median; Monocytes_min; Monocytes_range; mouth_max; mouth_median; mouth_min; mouth_range; onset_delta_mean; onset_site_mean; Platelets_max; Platelets_median; Platelets_min; Potassium_max; Potassium_median; Potassium_min; Potassium_range; pulse_max; pulse_median; pulse_min; pulse_range; Red.Blood.Cells..RBC._max; Red.Blood.Cells..RBC._median; Red.Blood.Cells..RBC._min; Red.Blood.Cells..RBC._range; respiratory_max; respiratory_median; respiratory_min; respiratory_range; Sodium_max; Sodium_median; Sodium_min; Sodium_range; SubjectID; trunk_max; trunk_median; trunk_min; trunk_range; Urine.Ph_max; Urine.Ph_median; Urine.Ph_min; Urine.Ph_range; White.Blood.Cell..WBC._max; White.Blood.Cell..WBC._median; White.Blood.Cell..WBC._min; White.Blood.Cell..WBC._range
References:
· Tang, M., Gao, C, Goutman, SA, Kalinin, A, Mukherjee, B, Guan, Y, and Dinov, ID. (2018) Model-Based and Model-Free Techniques for Amyotrophic Lateral Sclerosis Diagnostic Prediction and Patient Clustering, Neuroinformatics, 1-15, DOI: 10.1007/s12021-018-9406-9.
· https://scholar.google.com/scholar?hl=en&as_sdt=1%2C23&q=%22proact%22+%22als%22&btnG=
,
ID | Age_mean | Albumin_max | Albumin_median | Albumin_min | Albumin_range | ALSFRS_slope | ALSFRS_Total_max | ALSFRS_Total_median | ALSFRS_Total_min | ALSFRS_Total_range | ALT.SGPT._max | ALT.SGPT._median | ALT.SGPT._min | ALT.SGPT._range | AST.SGOT._max | AST.SGOT._median | AST.SGOT._min | AST.SGOT._range | Basophils_max | Basophils_median | Basophils_min | Basophils_range | Bicarbonate_max | Bicarbonate_median | Bicarbonate_min | Bicarbonate_range | Bilirubin..total._max | Bilirubin..total._median | Bilirubin..total._min | Bilirubin..total._range | Blood.Urea.Nitrogen..BUN._max | Blood.Urea.Nitrogen..BUN._median | Blood.Urea.Nitrogen..BUN._min | Blood.Urea.Nitrogen..BUN._range | BMI_max | bp_diastolic_max | bp_diastolic_median | bp_diastolic_min | bp_diastolic_range | bp_systolic_max | bp_systolic_median | bp_systolic_min | bp_systolic_range | Calcium_max | Calcium_median | Calcium_min | Calcium_range | Chloride_max | Chloride_median | Chloride_min | Chloride_range | Creatinine_max | Creatinine_median | Creatinine_min | Creatinine_range | Eosinophils_max | Eosinophils_median | Eosinophils_min | Eosinophils_range | Gender_mean | Glucose_max | Glucose_median | Glucose_min | Glucose_range | hands_max | hands_median | hands_min | hands_range | Hematocrit_max | Hematocrit_median | Hematocrit_min | Hematocrit_range | Hemoglobin_max | Hemoglobin_median | Hemoglobin_min | Hemoglobin_range | leg_max | leg_median | leg_min | leg_range | Lymphocytes_max | Lymphocytes_median | Lymphocytes_min | Lymphocytes_range | Monocytes_max | Monocytes_median | Monocytes_min | Monocytes_range | mouth_max | mouth_median | mouth_min | mouth_range | onset_delta_mean | onset_site_mean | Platelets_max | Platelets_median | Platelets_min | Potassium_max | Potassium_median | Potassium_min | Potassium_range | pulse_max | pulse_median | pulse_min | pulse_range | Red.Blood.Cells..RBC._max | Red.Blood.Cells..RBC._median | Red.Blood.Cells..RBC._min | Red.Blood.Cells..RBC._range | respiratory_max | respiratory_median | respiratory_min | respiratory_range | Sodium_max | Sodium_median | Sodium_min | Sodium_range | SubjectID | trunk_max | trunk_median | trunk_min | trunk_range | Urine.Ph_max | Urine.Ph_median | Urine.Ph_min | Urine.Ph_range | White.Blood.Cell..WBC._max | White.Blood.Cell..WBC._median | White.Blood.Cell..WBC._min | White.Blood.Cell..WBC._range |
3 | 65.90684932 | 46 | 44 | 43 | 0.024590164 | -1.767329256 | 33 | 5 | 2 | 0.028518859 | 93 | 26 | 22 | 0.581967213 | 72 | 24 | 21 | 0.418032787 | 0.8 | 0.5 | 0.3 | 0.004098361 | 27 | 25 | 23 | 0.032786885 | 9 | 7 | 3 | 0.049180328 | 7.9 | 7.1 | 5.7 | 0.018032787 | 0.002969349 | 91 | 76 | 69 | 0.180327869 | 134 | 125 | 103 | 0.254098361 | 2.53 | 2.45 | 2.35 | 0.00147541 | 103 | 101 | 98 | 0.040983607 | 71 | 62 | 44 | 0.221311475 | 5.1 | 3.9 | 3.5 | 0.013114754 | 2 | 7.2 | 4.4 | 3.6 | 0.029508197 | 7 | 0 | 0 | 0.006439742 | 45.4 | 44.5 | 44.1 | 0.010655738 | 149 | 146 | 141 | 0.06557377 | 8 | 1 | 0 | 0.007359706 | 24.1 | 19.6 | 17.4 | 0.054918033 | 6.3 | 5.8 | 5.4 | 0.007377049 | 8 | 4 | 2 | 0.005519779 | -617 | 1 | 275 | 275 | 275 | 4.3 | 4.2 | 3.9 | 0.003278689 | 90 | 77 | 61 | 0.237704918 | 4700 | 4640 | 4450 | 2.049180328 | 4 | 0 | 0 | 0.003679853 | 139 | 138 | 137 | 0.016393443 | 55888 | 7 | 0 | 0 | 0.006439742 | 6.5 | 6 | 6 | 0.004098361 | 8.57 | 7.68 | 6.6 | 0.016147541 |
4 | 54 | 39 | 36 | 33 | 0.013100437 | -1.351851852 | 32 | 23 | 14 | 0.03930131 | 47 | 35.5 | 21 | 0.056768559 | 49 | 33 | 20 | 0.063318777 | 1.2 | 0.7 | 0.3 | 0.001965066 | 26.7 | 25 | 21 | 0.012445415 | 19 | 9.5 | 5 | 0.030567686 | 5.7 | 4.3 | 3.4 | 0.005021834 | 0.002907331 | 106 | 96 | 75 | 0.06768559 | 160 | 135 | 110 | 0.109170306 | 2.47 | 2.345 | 2.26 | 0.000458515 | 104 | 101 | 96 | 0.017467249 | 69 | 47 | 21 | 0.104803493 | 5.4 | 2.15 | 1.1 | 0.009388646 | 2 | 7.2 | 4.6 | 3.1 | 0.008951965 | 6 | 4 | 0 | 0.013100437 | 52 | 47 | 42 | 0.021834061 | 170 | 158 | 139 | 0.06768559 | 5 | 1 | 0 | 0.010917031 | 22.2 | 15.9 | 11.9 | 0.022489083 | 8.1 | 5.35 | 3.8 | 0.009388646 | 12 | 12 | 11 | 0.002183406 | -328 | 4 | 349 | 270 | 215 | 4.6 | 4.2 | 3.8 | 0.001746725 | 104 | 80 | 68 | 0.07860262 | 5800 | 5100 | 4700 | 2.401746725 | 4 | 4 | 3 | 0.002183406 | 144 | 140.5 | 135 | 0.019650655 | 61505 | 6 | 3 | 0 | 0.013100437 | 6.5 | 5.5 | 5 | 0.003275109 | 8.04 | 6.62 | 4.97 | 0.006703057 |
5 | 56.39452055 | 46 | 43 | 39 | 0.009735744 | -0.412429379 | 15 | 10 | 2 | 0.017173052 | 42 | 22 | 11 | 0.043115438 | 37 | 22 | 14 | 0.031988873 | 1.4 | 0.7 | 0.5 | 0.001251739 | 27 | 24 | 20 | 0.009735744 | 5 | 3 | 2 | 0.004172462 | 8.2 | 5.4 | 2.9 | 0.007371349 | 0.00228116 | 85 | 72.5 | 65 | 0.026420079 | 140 | 103 | 90 | 0.066050198 | 2.53 | 2.43 | 2.33 | 0.000278164 | 104 | 101 | 97 | 0.009735744 | 53 | 35 | 18 | 0.04867872 | 5.5 | 3.25 | 0.4 | 0.007093185 | 1 | 7.3 | 5.3 | 4.7 | 0.003616134 | 0 | 0 | 0 | 0 | 48.7 | 42.35 | 39.2 | 0.013212796 | 143 | 133.5 | 122 | 0.029207232 | 0 | 0 | 0 | 0 | 32.7 | 19.15 | 14.3 | 0.025591099 | 9.8 | 6.85 | 3.4 | 0.008901252 | 11 | 6 | 2 | 0.011889036 | -953 | 2 | 391 | 391 | 391 | 5.2 | 4.5 | 3.8 | 0.001947149 | 123 | 103.5 | 70 | 0.07001321 | 5130 | 4590 | 4190 | 1.307371349 | 4 | 4 | 0 | 0.005284016 | 141 | 139 | 136 | 0.006954103 | 63255 | 0 | 0 | 0 | 0 | 7.5 | 6.75 | 6 | 0.003456221 | 8.9 | 7.16 | 5.01 | 0.005410292 |
6 | 72.61917808 | 50 | 42.5 | 41 | 0.092783505 | -0.383403361 | 34 | 24 | 21 | 0.033591731 | 109 | 39.5 | 20 | 0.917525773 | 83 | 42 | 23 | 0.618556701 | 0.9 | 0.7 | 0.3 | 0.006185567 | 27 | 26 | 25 | 0.020618557 | 9 | 5 | 3 | 0.06185567 | 7.5 | 5.2 | 3.9 | 0.037113402 | 0.002408304 | 67 | 59 | 54 | 0.134020619 | 148 | 130 | 120 | 0.288659794 | 2.55 | 2.43 | 2.38 | 0.001752577 | 108 | 103.5 | 102 | 0.06185567 | 80 | 71 | 71 | 0.092783505 | 5.4 | 3.9 | 1.3 | 0.042268041 | 1 | 6.9 | 5.7 | 5.2 | 0.017525773 | 8 | 7 | 5 | 0.007751938 | 39.8 | 37.55 | 35.7 | 0.042268041 | 127 | 119 | 112 | 0.154639175 | 6 | 4 | 3 | 0.007751938 | 38.4 | 20.5 | 11.1 | 0.281443299 | 10.8 | 6.15 | 4.5 | 0.064948454 | 8 | 4 | 2 | 0.015503876 | -490 | 1 | 383 | 383 | 383 | 4.8 | 4.5 | 4.4 | 0.004123711 | 76 | 73 | 63 | 0.134020619 | 4190 | 3950 | 3780 | 4.226804124 | 4 | 4 | 3 | 0.002583979 | 143 | 140 | 138 | 0.051546392 | 70641 | 8 | 5.5 | 5 | 0.007751938 | 7.5 | 7 | 6 | 0.024193548 | 12.38 | 7.905 | 4.96 | 0.076494845 |
9 | 65 | 45 | 42 | 36 | 0.021327014 | 0 | 37 | 37 | 37 | 0 | 48 | 16.5 | 13 | 0.082938389 | 272 | 25 | 22 | 0.592417062 | 1.1 | 0.8 | 0.6 | 0.001184834 | 27.4 | 22.9 | 19.3 | 0.019194313 | 12 | 9.5 | 5 | 0.016587678 | 7.5 | 6.1 | 4.5 | 0.007109005 | 0.002730997 | 102 | 85 | 69 | 0.078199052 | 179 | 140 | 118 | 0.144549763 | 2.57 | 2.43 | 2.34 | 0.000545024 | 107 | 105 | 102 | 0.011848341 | 106 | 91.5 | 84 | 0.052132701 | 3.7 | 1.6 | 1.1 | 0.006161137 | 2 | 7.7 | 5.65 | 4.5 | 0.007582938 | 6 | 6 | 6 | 0 | 50 | 45 | 42 | 0.018957346 | 159 | 149 | 140 | 0.045023697 | 8 | 8 | 8 | 0 | 39 | 32.9 | 25.7 | 0.031516588 | 6.9 | 5.3 | 4.3 | 0.006161137 | 12 | 12 | 12 | 0 | -329 | 4 | 357 | 258 | 229 | 5.1 | 4.45 | 4.1 | 0.002369668 | 84 | 67.5 | 59 | 0.059241706 | 5000 | 4700 | 4400 | 1.421800948 | 4 | 4 | 4 | 0 | 146 | 144 | 140 | 0.014218009 | 108342 | 7 | 7 | 7 | 0 | 6 | 5.5 | 5 | 0.002369668 | 11.53 | 9.29 | 7.94 | 0.008507109 |
10 | 67 | 40 | 35 | 25 | 0.038860104 | -1.40717675 | 32 | 24 | 9 | 0.059585492 | 22 | 15 | 8 | 0.03626943 | 25 | 22 | 17 | 0.020725389 | 1.5 | 0.85 | 0.6 | 0.002331606 | 34.5 | 26.85 | 24.2 | 0.026683938 | 10.26 | 7.8475 | 5 |