Given a year of EHR data for patients without Diabetes, we predict which patients will be diagnosed with Diabetes in the next year.
Data Scientist A
Company wants to predict patients with diabetes using EMR data. Company feels this will reduce cost significantly.
We are going to use a logloss metric.
===================================================================================================
Data Scientist B
Background
Business objectives
Business success criteria
Where: $$N$$ is the number of patients $$log$$ is the natural logarithm $$\hat{y}_{i}$$ is the posterior probability that the ith patient has diabetes $$y_{i}$$ is the ground truth ($$y_{i}=1$$ means the patient has diabetes, $$y_{i}=0$$ means that the patient does not).
Inventory of resources
Personnel Sources:
Sources of Data and Knowledge:
Identified data sources:
Identified types of data sources:
Identified knowledge sources:
Identified types of knowledge sources:
Describe the relevant background knowledge:
One of the two major types of diabetes, the type in which the beta cells of the pancreas produce insulin but the body is unable to use it effectively because the cells of the body are resistant to the action of insulin. Although this type of diabetes may not carry the same risk of death from ketoacidosis, it otherwise involves many of the same risks of complications as does type 1 diabetes (in which there is a lack of insulin).
The aim of treatment is to normalize the blood glucose in an attempt to prevent or minimize complications. People with type 2 diabetes may experience marked hyperglycaemia, but many do not require insulin injections and can be treated with diet, exercise, and oral hypoglycemic agents (drugs taken by mouth to lower the blood sugar).
Type 2 diabetes requires good dietary control including the restriction of calories, lowered consumption of simple carbohydrates and fat with increased consumption of complex carbohydrates and fiber. Regular aerobic exercise is also an important method for treating both type 2 diabetes since it decreases insulin resistance and helps burn excessive glucose. Regular exercise also may help lower blood lipids and reduce some effects of stress, both important factors in treating diabetes and preventing complications.
Type 2 diabetes is also known as insulin-resistant diabetes, non-insulin dependent diabetes, and adult-onset diabetes.
Computing Resources:
Model | vCPU | Mem (GB) | Storage | |
---|---|---|---|---|
c4.4xlarge | 16 | 30 | EBS-Only | 2,000 |
* Software:
* R
* Python
* Jupyter
Constraint | Result |
---|---|
All usernames/passwords tested for data access? | YES |
Verified all legal constraints on data usage? | YES |
Financial constraints covered in the project budget? | YES |
Task | Duration | Running Total | |
---|---|---|---|
Describe Data | 10 Days | 10 Days | |
Explore Data | 15 Days | 25 Days | |
Inferential Statistics | 10 Days | 35 Days | |
Modelling | 20 Days | 55 Days | |
Write-Up | 5 Days | 60 Days |
* Comprehensibility and quality of results
* Security
* Legal issues
* Assumptions
* Constraints
Describe the intended plan for achieving the data science goals and thereby the business goals. Specify the steps to be performed during the rest of the project, including the initial selection of tools and techniques.
Stage | Duration | Resource Required | Inputs | Outputs | |
---|---|---|---|---|---|
Describe Data | 10 Days | Data Scientist | SQL Tables from EMR | Summary Statistics; Visualizations | Data Architect |
Exploratory Data Analysis | 15 Days | Data Scientist, Clinician | Data | Outliers; Visualizations | Data Description |
Inferential Statistics | 10 Days | Data Scientist, Clinician | Data | Imputation of Values | Clean Data Set |
Feature Engineering | 1 Day | Data Scientist | Data | Enhanced Features | Inferential Statistics |
Feature Selection | 1 Day | Data Scientist | Data | Important Features | Feature Engineering |
Model Training | 15 Days | Data | Data Scientist | Highest Accuracy Models | Feature Selection |
Evaluation | 4 Days | Models | Data Scientist | Identify Successful Models | Model Training |
Ensemble | 4 Days | Data Scientist | Evaluation; Models | Stable Model | Evaluation |