Finding Type II Diabetes In A Patient Population¶

Given a year of EHR data for patients without Diabetes, we predict which patients will be diagnosed with Diabetes in the next year.

Excercise 3¶

Choose between A and B.

Data Scientist A

PHASE III: Data Preparation¶

Select Data¶

Our aim is to decide on the data to be used for analysis. Criteria for this include relevance to the data science goals, quality, and technical constraints such as limits on data volume or data types.

Rationals for inclusion/exclusion
- List the data to be included and the reasons:
- List the data to be excluded and the reason:

Clean Data¶

Raise the data quality to the level required by the selected analysis techniques.

Data cleaning report
- Describe what decision and actions were taken to address the data quality problems reported during the Verify Data Quality and Data Understanding phase:
- Transformations of the data for cleaning purposes and the possible impact on the analysis should be considered.

Construct Data¶

This task includes constructive data preparation operations such as the product of derived attributes or entire new records or transformed values for existing attributes.

Derived attributes
- New attributes that are constructed from one or more existing attributes:
Generated records
- Describe the creation of completely new records:

Integrate Data¶

Methods whereby information is combined from multiple tables or records to create new records or values

Merged data
- Joining together various tables:
- Aggregations:

Format Data¶

Formatting transformations refer to the syntactic modifications made to the data that do not change the meaning, but might be required by the modeling tool.

Reformatted data
- What was required to appropriately take into the model:

===================================================================================================

Data Scientist B

PHASE III: Data Preparation¶

Select Data¶

Our aim is to decide on the data to be used for analysis. Criteria for this include relevance to the data science goals, quality, and technical constraints such as limits on data volume or data types.

We included data records that seem to appear normal to our teams subject matter experts. In addition, any other data records which appeared abnormal were exluded from the training dataset.

Clean Data¶

We aimed to raise the data quality to the level required by our selected analysis techniques.

We addressed typos and outliers to improve data quality problems which we encountered during hte verify data quality and data understanding phase. We spent considerable amount of time in cleaning transcript measures.

Construct Data¶

Below are features which have been derived attributes:

Height median was calculated
BMI was recalculated with this constant height for each paitent, eliminating any noise of measures fluctuations.
For weight, height, BMI, systolic blood pressure, diastolic blood presure, temperature, respiratory rate were calculated median and truncated maximum and minimum.
For 2012 and 2009 years, not complete, a weight was used for calcuate features with ratios.
Other features were created for missing data: medication without prescriptions, diagnoses wihtout trsnscripts, patients without lab observation.
Some families treated were ACEI (Angiotensin Converting Enzyme Inhibitor), AIIRA (Angiotensin II Receptor Antagnosists), antifungals, benzodiazepines, beta blockers, fibrates, glucocorticoids, L type Calcium channel blockers, statins, thiazides, antillipemic or loop diuretics.
For states wiht more than 450 patients a binary feature was used.
'Smoking Status' and 'Previous Smoking Situation' were created. Other possible features like allergy or immunization don't take account due low number of patients.

We did not generate any new records for this data.

Integrate Data¶

We have merged together data using ~1,800 lines of code in the FeatureCreation.R file various methods to create new values for each patient record:

New Values Created	Features Generated
	Smoker
	L2_AbdominalHernia
	L2_AbdominalPain
	L2_AbuseMonotoring
	L2_Acne
	L2_AcuteBronchitis
	L2_AcuteCystitis
	L2_Alcohol
	L2_Allergy
	L2_AMI
	L2_AnginaPectoris
	L2_Anxiety
	L2_Arthropathy
	L2_Asthma
	L2_AtherosclerosisCoronary
	L2_AtherosclerosisPeripheral
	L2_BackPain
	L2_Bladder
	L2_BlindDeficiency
	L2_BloodAnormal
	L2_BoneCartilage
	L2_BoneDeformity
	L2_Calculus
	L2_Candida
	L2_CaProstate
	L2_CardiacInsufficiency
	L2_CardiacOther
	L2_CardiacValve
	L2_CarpalSyndrome
	L2_CaSkin
	L2_Cataract
	L2_Cellulitis
	L2_CerebralDegeneration
	L2_CerebroVascular
	L2_Cerumen
	L2_Cervical
	L2_ChestPain
	L2_ChronicCystitis
	L2_ChronicPainSynd
	L2_ChronicRenalFailure
	L2_Coagulation
	L2_ColitisNoninfectious
	L2_Conjunctivitis
	L2_Constipation
	L2_COPD
	L2_Corns
	L2_Cough
	L2_CRP
	L2_DeficiencyAnemia

Format Data¶

We did not change the meaning of the data, but we did reformat many of the continuous values in the patient data.

Reformatted data: We implemented normalization of the continuous features in the the Patient ABT that cover very different ranges since this is known to cause difficulty for some machine learning algorithsm. We applied this technique so we can change the continous feature to fall within a specified range while maintaining the relative differences between the values for the feature.

$$a_{'}^{i}=\frac{a_{i}-min(a)}{max(a)-min(a)} \times(high-low)+low$$

where $$a_{'}^{i}$$ is the normalized feature value $$a_{i}$$ is the original value $$min(a)$$ is the minimum value of feature a $$max(a)$$ is the maximum value of feature a $$low$$ and $$high$$ are the minimum and maximum values of the desired range

In [ ]: