We are in the process of cleaning up the COVID EHR data, and expect to get a model training soon. We attempted to train a set of preliminary models, but ran into some data issues. We were
Assignment 2: Machine Learning Model Training 1
• Two multi-part, multiple-choice questions.
• AI in Healthcare with Phase 2 data set (HTML file)
• Details of the Q1 & Q2 m/c questions are shown in the attached question files.
• Lecture notes on Machine Learning in Healthcare for your reference
Phase 2: Model Training, Part 1 ¶
Welcome to Phase 2 of the capstone project. This section will be the first of two parts that concerns the model training process of the model development cycle. You continue to play the role of a bioinformatics professor. The questions will relate to the various challenges faced by the teams working on the two projects introduced in the first section.
Your two research teams have begun working on the projects, and have some preliminary results. Both teams have e-mailed you summaries of progress thus far, which are shown below.
Project 1: CXR-based COVID-19 Detector ¶
Hi, We are super excited to get this project kicked off! We have implemented the data pipeline and trained a few preliminary models, but there is still lots of room for improvement. Here is what we’ve done so far: We split the data randomly into a training and test set. We are placing 90% of the data into the training set and 10% of the data into the test set. Additionally, the images were initially massive, on the order of 3000 by 3000 pixels. So, we re-sized the images to 224 by 224 pixels. We are using the ResNet-50 CNN architecture. During training, we are applying data augmentation. Concretely, on a given image, with 50% probability, we are zooming in on a small, randomly selected region before feeding it to the model. Here is an example:
So far, we have seen the following training curves from our model. The loss for neither the training set nor the test set goes down very much.
As you can see, there is plenty of room for improvement. We’ll keep working on it, but let us know if you have any suggestions. Thanks.
Project 2: EHR-based Intubation Predictor ¶
Hello, We are in the process of cleaning up the COVID EHR data, and expect to get a model training soon. We attempted to train a set of preliminary models, but ran into some data issues. We were wondering if you could take a look at some of the problems we’ve found in the data and let us know what you think. First, we noticed that the EHR data is actually quite sparse relative to what we thought we have. We only have about 3,000 EHR records– not 30,000, as we originally thought. This leaves us with about 300 COVID-positive and 2,700 COVID-negative exams. We might not be able to train a model on this data alone. We are noticing some very strange patterns in the data, particularly in the lab values. For example, see the following histogram of D-DIMER lab values found for each exam across the entire dataset. The x-axis is the D-DIMER lab values, and the y-axis is the number of exams with that count. We use a log-scale on the y-axis improve readability.
We saw this in several CSV columns, including Ferritin, and Procalcitonin lab values. **We suspect that there is some underlying phenomena affecting all three lab values.** Another issue we were running into were missing column values. We can’t create a feature vector for Logistic Regression if we are missing some values. How do you suggest we proceed regarding both the large outlier values and the missing values? Below is an example of the data once again, this time a sample of 30 exams (with the observed symptoms excluded). Note that NaN in the CSV means that the value is missing. Please take a look and let us know if you see something that we might have missed.
In [1]:
pd.read_csv('COVID_19_sample_data.csv')[ ['pat_deid', 'intubation_date', 'IP_admission_date', 'IP_discharge_date', 'clinic', 'birth_date', 'death_date', 'gender', 'ethnicity', 'race_new', 'LYMAB', 'CK', 'CR', 'LDH', 'TNI', 'DDIMER', 'FERRITIN', 'PROCTL', 'PT', 'BUN', 'CRP', 'SPO2', 'FIO2', 'NA']].iloc[5:35]
Out[1]:
pat_deid | intubation_date | IP_admission_date | IP_discharge_date | clinic | birth_date | death_date | gender | ethnicity | race_new | LYMAB | CK | CR | LDH | TNI | DDIMER | FERRITIN | PROCTL | PT | BUN | CRP | SPO2 | FIO2 | NA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 8f9539f2-e6ad-4e00-ad45-4abc2bff2214 | NaN | 2020-03-04 | 2020-03-14 | Clinic B | 1965-11-11 | NaN | F | nonhispanic | white | 1.0 | 51.6 | NaN | 232.4 | 0.0020 | 400000.0 | 503900.0 | 0.0 | NaN | 16.4 | 0.88 | 88.1 | NaN | 136.9 |
6 | d5dd13c4-c31e-419c-8c02-47e4ca1ac5e2 | NaN | 2020-03-02 | 2020-03-20 | Clinic B | 2018-08-16 | NaN | F | nonhispanic | white | 1.1 | 52.9 | NaN | 300.7 | 0.0030 | 700000.0 | 579600.0 | 100.0 | 10.3 | 16.2 | 0.83 | 80.6 | 32.8 | 140.0 |
7 | 91369e11-b944-4132-be0f-af46e880936b | NaN | 2020-03-02 | 2020-03-21 | Clinic C | 1972-09-22 | NaN | M | nonhispanic | white | 1.2 | 45.0 | NaN | NaN | 0.0047 | 600.0 | 562.3 | 0.2 | 12.8 | 10.7 | 0.85 | 84.4 | NaN | 141.4 |
8 | c70992c9-ff13-467b-9032-1901506edeef | NaN | 2020-02-29 | 2020-03-05 | Clinic C | 1959-06-17 | 2020-03-11 | M | nonhispanic | white | 0.8 | 84.2 | NaN | 321.8 | 0.0092 | 1100.0 | 877.6 | 0.3 | 13.5 | 7.6 | NaN | 88.6 | 88.3 | 141.0 |
9 | c70992c9-ff13-467b-9032-1901506edeef | 2020-03-05 | 2020-03-05 | 2020-03-12 | Clinic B | 1959-06-17 | 2020-03-11 | M | nonhispanic | white | 0.5 | 29.7 | NaN | 391.9 | 0.0553 | 17900000.0 | 1786400.0 | 200.0 | 12.7 | 11.5 | 0.94 | 75.6 | 88.3 | 143.2 |
10 | 9ec7d743-96e7-47c8-b2ee-6336633beb39 | NaN | 2020-03-10 | 2020-03-23 | Clinic B | 1969-03-22 | NaN | M | nonhispanic | white | 1.1 | 35.7 | NaN | 312.5 | 0.0045 | 600000.0 | 615900.0 | 100.0 | 11.1 | 15.6 | 0.98 | 83.7 | NaN | 143.8 |
11 | 9ec7d743-96e7-47c8-b2ee-6336633beb39 | NaN | 2020-03-23 | 2020-03-25 | Clinic C | 1969-03-22 | NaN | M | nonhispanic | white | 1.2 | 24.4 | NaN | 254.3 | 0.0021 | 600.0 | 406.2 | 0.1 | 11.8 | 15.6 | 0.80 | NaN | NaN | 143.8 |
12 | a527bcf0-3746-476c-90f2-dbab8868385e | NaN | 2020-03-03 | 2020-03-17 | Clinic C | 1978-11-27 | NaN | F | nonhispanic | white | 1.1 | 25.7 | NaN | 228.1 | 0.0038 | 500.0 | 459.5 | 0.0 | 12.8 | 13.5 | 1.14 | 77.7 | NaN | 138.4 |
13 | 7f4ef129-1511-47a9-a9b7-8b0b2d02ad50 | NaN | 2020-03-12 | 2020-03-25 | Clinic C | 1952-05-06 | NaN | F | nonhispanic | white | 1.0 | 29.7 | NaN | 321.5 | 0.0038 | 700.0 | 450.1 | 0.2 | 12.1 | 15.0 | 1.20 | 75.8 | NaN | 139.6 |
14 | 7078ae9a-4c79-4b30-b127-f76aabb6763e | NaN | 2020-02-17 | 2020-03-07 | Clinic B | 1968-04-26 | NaN | F | hispanic | white | 0.9 | 48.7 | NaN | 306.5 | NaN | 600000.0 | 509000.0 | 100.0 | 12.0 | 18.4 | 1.14 | 77.3 | NaN | 139.6 |
15 | a5c39700-6bf3-4984-af46-31344695e21b | NaN | 2020-03-05 | 2020-03-13 | Clinic A | 1940-01-09 | 2020-03-15 | M | nonhispanic | white | 0.7 | 85.5 | NaN | 312.9 | 0.0257 | NaN | 1552.8 | NaN | NaN | 14.9 | 1.23 | 79.2 | 64.5 | 143.8 |
16 | a5c39700-6bf3-4984-af46-31344695e21b | 2020-03-12 | 2020-03-12 | 2020-03-16 | Clinic C | 1940-01-09 | 2020-03-15 | M | nonhispanic | white | 0.5 | 170.6 | NaN | 390.5 | 0.0238 | 15000.0 | 1755.9 | 0.2 | 14.0 | 14.9 | 1.15 | 75.6 | 40.9 | 143.8 |
17 | ddb2d5e2-643e-4374-ac19-f6ca3c0d16f5 | NaN | 2020-02-25 | 2020-03-09 | Clinic C | 1967-12-24 | NaN | M | nonhispanic | white | NaN | 51.5 | NaN | 234.4 | 0.0029 | 500.0 | 527.4 | 0.1 | 11.5 | 12.0 | 0.90 | 88.9 | 42.0 | 138.8 |
18 | 21505aac-f219-43a8-ab3c-f57c6d8f1d1f | NaN | 2020-03-08 | 2020-03-21 | Clinic B | 1940-05-03 | NaN | F | nonhispanic | white | 1.1 | 38.7 | NaN | 229.3 | 0.0029 | 600000.0 | 536500.0 | NaN | 12.6 | 8.4 | 1.07 | 77.0 | NaN | 137.4 |
19 | 7992bf94-feee-4728-9187-2c911df2819b | NaN | 2020-03-03 | 2020-03-17 | Clinic C | 2004-07-04 | NaN | F | nonhispanic | white | 1.0 | 27.3 | NaN | 238.9 | 0.0022 | 600.0 | NaN | 0.1 | 11.0 | 10.9 | 0.98 | NaN | NaN | 138.4 |
20 | d2f6d528-39db-4b7e-8389-abd27af9a710 | NaN | 2020-02-29 | 2020-03-12 | Clinic B | 1996-06-26 | NaN | F | nonhispanic | white | 1.1 | 31.5 | NaN | 254.3 | 0.0034 | 700000.0 | 455300.0 | 0.0 | 10.3 | 14.6 | 1.08 | 76.6 | NaN | 138.4 |
21 | fa0b58e6-6817-4d49-8211-1dd34abf0c15 | NaN | 2020-03-11 | 2020-03-28 | Clinic C | 2008-11-21 | NaN | M | nonhispanic | white | 0.9 | 23.0 | NaN | 232.0 | 0.0030 | 600.0 | 542.0 | 0.1 | 10.8 | 10.6 | 0.88 | 86.3 | NaN | 141.6 |
22 | b83237f3-9ff5-491e-aab4-d63ccff85f85 | NaN | 2020-03-13 | 2020-03-30 | Clinic C | 2012-11-17 | NaN | M | nonhispanic | white | 1.1 | 47.2 | NaN | 329.8 | 0.0038 | 700.0 | 535.0 | 0.0 | 12.4 | 10.1 | 1.13 | 75.3 | NaN | 137.3 |
23 | 46988a9c-9c86-429a-bc4a-b3d14ff321b0 | NaN | 2020-03-11 | 2020-03-21 | Clinic B | 1957-03-13 | NaN | M | nonhispanic | asian | NaN | 37.0 | NaN | 235.7 | 0.0030 | NaN | 535000.0 | 100.0 | 10.2 | 19.1 | 0.86 | 79.1 | NaN | 137.5 |
24 | 46988a9c-9c86-429a-bc4a-b3d14ff321b0 | 2020-03-20 | 2020-03-21 | 2020-03-24 | Clinic B | 1957-03-13 | NaN | M | nonhispanic | asian | 1.2 | 29.6 | NaN | 304.2 | 0.0044 | 500000.0 | 553400.0 | 0.0 | 11.8 | 9.2 | 1.13 | 98.7 | 86.8 | 143.5 |
25 | 785b484d-7060-4d17-bf18-ef8bbafc6f04 | NaN | 2020-02-28 | 2020-03-10 | Clinic B | 1942-08-24 | NaN | F | nonhispanic | white | 0.8 | 21.3 | NaN | 226.3 | 0.0023 | 500000.0 | 468600.0 | NaN | 12.6 | 17.2 | 0.91 | 81.7 | NaN | 138.2 |
26 | edad31f3-5a08-4678-8d31-271a41a2aad5 | NaN | 2020-03-05 | 2020-03-13 | Clinic C | 1940-01-09 | 2020-03-19 | M | nonhispanic | white | 0.6 | 78.6 | NaN | 306.4 | 0.0256 | 2600.0 | 1764.2 | 0.2 | 13.8 | 18.0 | 1.14 | 83.0 | 62.0 | 141.2 |
27 | edad31f3-5a08-4678-8d31-271a41a2aad5 | 2020-03-12 | 2020-03-12 | 2020-03-20 | Clinic C | 1940-01-09 | 2020-03-19 | M | nonhispanic | white | 0.3 | 184.4 | NaN | 370.1 | 0.0639 | 20600.0 | 1804.0 | 0.3 | 11.5 | 7.7 | 1.23 | 84.4 | NaN | 142.2 |
28 | 4607a669-4a97-4f0a-9661-856569905047 | NaN | 2020-03-09 | 2020-03-21 | Clinic C | 1993-11-26 | NaN | F | nonhispanic | white | 1.1 | 48.7 | NaN | NaN | 0.0037 | 600.0 | 590.9 | 0.1 | 12.5 | 13.7 | 0.97 | 81.1 | NaN | 142.4 |
29 | c1800ba1-7cba-45d7-bdc4-0e0b583932e4 | NaN | 2020-02-23 | 2020-03-08 | Clinic A | 2018-01-20 | NaN | M | hispanic | white | NaN | 25.7 | NaN | 306.4 | 0.0030 | 600.0 | 464.0 | 0.0 | 10.9 | 7.6 | NaN | 77.8 | NaN | 141.7 |
30 | d2718050-2e9c-4d5b-842e-52d910c1563f | NaN | 2020-03-04 | 2020-03-17 | Clinic C | 1997-06-01 | NaN | M | nonhispanic | white | 1.2 | 50.6 | NaN | 339.8 | 0.0047 | 600.0 | 634.5 | 0.2 | 10.2 | 13.6 | 1.08 | 81.5 | NaN | 137.9 |
31 | d2718050-2e9c-4d5b-842e-52d910c1563f | NaN | 2020-03-17 | 2020-03-22 | Clinic A | 1997-06-01 | NaN | M | nonhispanic | white | 1.3 |