Reference: https://en.wikipedia.org/wiki/Association_rule_learning
Association mining falls into the category of UNSUPERVISED LEARNING. Association mining is useful to find patterns or rules for 2 or more items in a dataset. In this sample, associations are calculated as follows:-
- Association from Diagnoses to Services ,
- Association from Diagnosis and Provider to Service.
Screenshots towards the bottom of this page show that even if one does not have a medical background, one can get a pretty good understanding of associated diagnoses and services.
An association may be POSITIVE i.e. presence of an item implies presence of another item, or NEGATIVE i.e. presence of an item implies absence of another item. This positive/negative association is derived from a ratio of two components :-
- LIFT = ACTUAL / EXPECTED , tells us - " How much more than expected is our Association ? "
- When LIFT = 1, it means there is neither positive nor negative association, i.e. items compared are independent.
- When LIFT > 1, it means there is positive association, i.e. presence of one item implies presence of the other item.
- When LIFT < 1, it means there is negative association, i.e. presence of one item implies absence of the other item.
The ACTUAL and EXPECTED metrics are calculated using concepts called SUPPORT and CONFIDENCE.
- SUPPORT represents the frequency of an item in the dataset.
- CONFIDENCE represents conditional probability, i.e. probability of finding RHS item provided LHS item already exists.
- Support A = (No. of transactions containing A) / (Total No. of transactions)
- Support A to B = (No. of transactions containing A and B) / (Total No. of transactions)
- Confidence A to B = (No. of transactions containing A and B) / (No. of transactions containing A)
- Expected Confidence A to B = (No. of transactions containing B) / (Total No. of transactions)
- Lift A to B = (Confidence A to B) / (Expected Confidence A to B)
(1) Clean raw data:-
- Python program to clean raw csv files: ~/association_mining/step01_clean_raw_data/CleanRawData.py
- INPUT: Raw input csv files at ~/association_mining/step01_clean_raw_data/raw_csv_files/*.csv
- OUTPUT: Clean csv files at ~/association_mining/step02_association_mining/clean_csv_files/
(2) Association Mining:-
- Python program to find associations: ~/association_mining/step02_association_mining/AssociationMining.py
- INPUT: Clean csv files at ~/association_mining/step02_association_mining/clean_csv_files/*.csv
- OUTPUT: ~/association_mining/step02_association_mining/clean_csv_files/tran_df.csv
Data is a sample of claims data. Columns explained below:-
RAW CSV FILES
(1) raw_csv_files/tran.csv
- tid: Transaction ID. This is equivalent to a claim id. A claim is submitted by a provider for receiving payment. This tid is the metric counted for finding associations.
- servprov: Servicing Provider ID. This is just an ID column.
- diagcode: Diagnosis ID. This is just an ID column, not the actual diagnosis code.
- servcode: Service Code. This is just an ID column, not the actual service code. The claim tells us which provider rendered what service against which diagnoses.
(2) raw_csv_files/diag.csv
- dimDiagnosisID: same as diagcode in the transactions file. This is just an ID column.
- DiagnosisCode: Diagnosis code present on the claim.
- DiagnosisShortDesc: Short description of the diagnosis.
- DiagnosisLongDesc: Long description of the diagnosis.
(3) raw_csv_files/prov.csv
- dimProviderID: same as servprov in the transactions file. This is just an ID column.
- ProviderName: Provider's name
(4) raw_csv_files/serv.csv
- dimServiceCodeID: same as servcode in the transactions file. This is just an ID column.
- ServiceCode: Service code present on the claim.
- ServiceCodeShortDesc: Short description of the service rendered.
- ServiceCodeLongDesc: Long description of the diagnosis rendered.
CLEAN CSV FILES
(1) clean_csv_files/clean_tran.csv. This is generated by cleaning the raw csv file.
- tid: Transaction ID. This is equivalent to a claim id. A claim is submitted by a provider for receiving payment. This tid is the metric counted for finding associations.
- servprov: Servicing Provider ID. This is just an ID column.
- diagcode: Diagnosis ID. This is just an ID column, not the actual diagnosis code.
- servcode: Service Code. This is just an ID column, not the actual service code. The claim tells us which provider rendered what service against which diagnoses.
(2) clean_csv_files/clean_diag.csv. This is generated by cleaning the raw csv file.
- dimDiagnosisID: same as diagcode in the transactions file. This is just an ID column.
- DiagnosisCode: Diagnosis code present on the claim.
- DiagnosisShortDesc: Short description of the diagnosis.
- DiagnosisLongDesc: Long description of the diagnosis.
(3) clean_csv_files/clean_prov.csv. This is generated by cleaning the raw csv file.
- dimProviderID: same as servprov in the transactions file. This is just an ID column.
- ProvName: Provider's name randomly scrambled.
(4) clean_csv_files/clean_serv.csv. This is generated by cleaning the raw csv file.
- dimServiceCodeID: same as servcode in the transactions file. This is just an ID column.
- ServiceCode: Service code present on the claim.
- ServiceCodeShortDesc: Short description of the service rendered.
- ServiceCodeLongDesc: Long description of the diagnosis rendered.
OUTPUT CSV FILE: clean_csv_files/tran_df.csv This file is the final output with all association mining metrics calculated.
(1) Services associated with diagnosis TYPE 2 DIABETES MELLITUS PDR MACULAR EDEMA BILATERAL
(2) Diagnoses associated with service Treatment of extensive or progressive retinopathy (eg, diabetic retinopathy), photocoagulation
(3) Services associated with diagnosis Primary osteoarthritis, right hand
(4) Diagnoses associated with service APPLICATION CAST ELBOW FINGER SHORT ARM
(5) Services associated with diagnosis Osteonecrosis in diseases classified elsewhere, left thigh
(6) Services associated with diagnosis Eyelid retraction left upper eyelid
These associations have also been deployed as an API Web Application. See https://github.com/nsb700/association-mining-webapp