Given a year of EHR data for patients without Diabetes, we predict which patients will be diagnosed with Diabetes in the next year.
Data Scientist A
Our aim is to decide on the data to be used for analysis. Criteria for this include relevance to the data science goals, quality, and technical constraints such as limits on data volume or data types.
Raise the data quality to the level required by the selected analysis techniques.
This task includes constructive data preparation operations such as the product of derived attributes or entire new records or transformed values for existing attributes.
Methods whereby information is combined from multiple tables or records to create new records or values
Formatting transformations refer to the syntactic modifications made to the data that do not change the meaning, but might be required by the modeling tool.
===================================================================================================
Data Scientist B
Our aim is to decide on the data to be used for analysis. Criteria for this include relevance to the data science goals, quality, and technical constraints such as limits on data volume or data types.
We included data records that seem to appear normal to our teams subject matter experts. In addition, any other data records which appeared abnormal were exluded from the training dataset.
We aimed to raise the data quality to the level required by our selected analysis techniques.
We addressed typos and outliers to improve data quality problems which we encountered during hte verify data quality and data understanding phase. We spent considerable amount of time in cleaning transcript measures.
Below are features which have been derived attributes:
We did not generate any new records for this data.
We have merged together data using ~1,800 lines of code in the FeatureCreation.R file various methods to create new values for each patient record:
New Values Created | Features Generated |
---|---|
Smoker | |
L2_AbdominalHernia | |
L2_AbdominalPain | |
L2_AbuseMonotoring | |
L2_Acne | |
L2_AcuteBronchitis | |
L2_AcuteCystitis | |
L2_Alcohol | |
L2_Allergy | |
L2_AMI | |
L2_AnginaPectoris | |
L2_Anxiety | |
L2_Arthropathy | |
L2_Asthma | |
L2_AtherosclerosisCoronary | |
L2_AtherosclerosisPeripheral | |
L2_BackPain | |
L2_Bladder | |
L2_BlindDeficiency | |
L2_BloodAnormal | |
L2_BoneCartilage | |
L2_BoneDeformity | |
L2_Calculus | |
L2_Candida | |
L2_CaProstate | |
L2_CardiacInsufficiency | |
L2_CardiacOther | |
L2_CardiacValve | |
L2_CarpalSyndrome | |
L2_CaSkin | |
L2_Cataract | |
L2_Cellulitis | |
L2_CerebralDegeneration | |
L2_CerebroVascular | |
L2_Cerumen | |
L2_Cervical | |
L2_ChestPain | |
L2_ChronicCystitis | |
L2_ChronicPainSynd | |
L2_ChronicRenalFailure | |
L2_Coagulation | |
L2_ColitisNoninfectious | |
L2_Conjunctivitis | |
L2_Constipation | |
L2_COPD | |
L2_Corns | |
L2_Cough | |
L2_CRP | |
L2_DeficiencyAnemia |
We did not change the meaning of the data, but we did reformat many of the continuous values in the patient data.
Reformatted data: We implemented normalization of the continuous features in the the Patient ABT that cover very different ranges since this is known to cause difficulty for some machine learning algorithsm. We applied this technique so we can change the continous feature to fall within a specified range while maintaining the relative differences between the values for the feature.
$$a_{'}^{i}=\frac{a_{i}-min(a)}{max(a)-min(a)} \times(high-low)+low$$where $$a_{'}^{i}$$ is the normalized feature value $$a_{i}$$ is the original value $$min(a)$$ is the minimum value of feature a $$max(a)$$ is the maximum value of feature a $$low$$ and $$high$$ are the minimum and maximum values of the desired range