Predicting Insurance fraud using a simulated dataset from Insurance Australia Group.
log = pd.read_csv('Dataset/data.csv')
log.head(20)
message | timestamp | |
---|---|---|
0 | 8f70c7577be8483 - mobile_browser - Quote Started for customer: 99ccf1 | 1483192800.00 |
1 | 1368d40a4f6e455 - mobile_browser - Quote Completed for customer: 99ccf1 with json payload {'name': 'Nicole Berry', 'email': 'Nicole [email protected]', 'gender': 'male', 'age': 29, 'home': {'type': 1, 'square_footage': 311.80361967382737, 'number_of_bedrooms': 2, 'number_of_floors': 1}, 'household': [{'name': 'Oscar Berry', 'age': 25, 'gender': 'female'}, {'name': 'Mark Berry', 'age': 10, 'gender': 'female'}, {'name': 'Jacqueline Berry', 'age': 14, 'gender': 'male'}], 'address': '66 Lake Jamieview,PSC '} | 1483193676.51 |
2 | 90527688b31d445 - mobile_browser - Claim Started for customer: 99ccf1 | 1483193794.69 |
3 | c4013f44ea6d40c - mobile_browser - Payment Completed for customer: 99ccf1 | 1483193794.69 |
4 | 8045614075e7466 - pc_browser - Quote Started for customer: 9bae09 | 1483196400.00 |
5 | 6859e40fdc3f40d - pc_browser - Quote Completed for customer: 9bae09 with json payload {'name': 'Brandi Harris', 'email': 'Brandi [email protected]', 'gender': 'male', 'age': 62, 'home': {'type': 1, 'square_footage': 523.4329572865342, 'number_of_bedrooms': 2, 'number_of_floors': 1}, 'household': [{'name': 'Michael Harris', 'age': 12, 'gender': 'male'}, {'name': 'Michael Harris', 'age': 7, 'gender': 'male'}], 'address': '60 West Lisaside, Jamie Port Suite '} | 1483197036.47 |
6 | 4c9ab2942b484f2 - pc_browser - Claim Started for customer: 9bae09 | 1483197184.51 |
7 | 07ba7defa6444ba - pc_browser - Payment Completed for customer: 9bae09 | 1483197184.51 |
8 | cb6d5db3f2ce478 - pc_browser - Quote Started for customer: 12fdce | 1483199674.68 |
9 | 0ddb305e024d49f - pc_browser - Quote Started for customer: b7aab4 | 1483200000.00 |
10 | be4398c940284fe - pc_browser - Quote Completed for customer: b7aab4 with json payload {'name': 'Christopher Moody', 'email': 'Christopher [email protected]', 'gender': 'male', 'age': 40, 'home': {'type': 1, 'square_footage': 221.63326677793447, 'number_of_bedrooms': 3, 'number_of_floors': 1}, 'household': [], 'address': '120 Danielmouth,Unit Box '} | 1483200799.03 |
11 | cf1d5d9af6d54ef - pc_browser - Claim Started for customer: b7aab4 | 1483200904.55 |
12 | 7a1836f6aacf4ec - pc_browser - Payment Completed for customer: b7aab4 | 1483200904.55 |
13 | 697990436eb5484 - pc_browser - Quote Incomplete for customer: 12fdce with json payload {'name': 'Latoya Johnston', 'email': 'Latoya [email protected]', 'gender': 'male', 'age': 28, 'home': {'type': 1}, 'household': [], 'address': '136 Jonesberg,Unit Box '} | 1483203274.68 |
14 | e6cd8ce31a1d4d6 - mobile_browser - Claim Denied for customer: 99ccf1 - reason : fraud | 1483203294.05 |
15 | f21cd7a34cbc412 - pc_browser - Quote Started for customer: 2bc68b | 1483203600.00 |
16 | 84618ef8bc28479 - mobile_browser - Claim Started for customer: 983092 | 1483203691.28 |
17 | e67b69c9b4554c0 - pc_browser - Claim Denied for customer: b7aab4 - reason : fraud | 1483204320.48 |
18 | 0baaef67fe8a458 - pc_browser - Quote Completed for customer: 2bc68b with json payload {'name': 'Loretta Steele', 'email': 'Loretta [email protected]', 'gender': 'female', 'age': 48, 'home': {'type': 1, 'square_footage': 301.854949906065, 'number_of_bedrooms': 3, 'number_of_floors': 1}, 'household': [{'name': 'Zachary Steele', 'age': 48, 'gender': 'male'}, {'name': 'Nicholas Steele', 'age': 9, 'gender': 'male'}, {'name': 'Michael Steele', 'age': 12, 'gender': 'female'}], 'address': '73 Edwardfurt, Michelle Crossing Suite '} | 1483204668.62 |
19 | a79b052cd8f2480 - pc_browser - Payment Completed for customer: 2bc68b | 1483204804.68 |
# the log is sorted in icreasing order
log['timestamp'].is_monotonic
True
split the message column into multiple columns, transaction_id, action, customer_id, action, message
#%%timeit
start_time = time.time()
# create msgLog for manipulating message data
msgLog = pd.DataFrame()
#msgLog['message'] = log['message']
msgLog['transaction_id'], msgLog['device'], msgLog['message'] = log['message'].str.split('-',2).str
msgLog['action'], msgLog['message'] = msgLog['message'].str.split(':',1).str
msgLog['customer_id'], msgLog['message'] = msgLog['message'].str.split(n=1).str
msgLog['action'] = msgLog['action'].str.strip()
#msgLog['message'] = msgLog['message'].str.split('json payload',1).str[0]
msgLog['message'], msgLog['json_payload'] = msgLog['message'].str.split('json payload',1).str
msgLog['timestamp'] = log['timestamp']
print("%s seconds" %(time.time() - start_time))
65.64751410484314 seconds
msgLog.head()
transaction_id | device | message | action | customer_id | json_payload | timestamp | |
---|---|---|---|---|---|---|---|
0 | 8f70c7577be8483 | mobile_browser | NaN | Quote Started for customer | 99ccf1 | NaN | 1483192800.00 |
1 | 1368d40a4f6e455 | mobile_browser | with | Quote Completed for customer | 99ccf1 | {'name': 'Nicole Berry', 'email': 'Nicole [email protected]', 'gender': 'male', 'age': 29, 'home': {'type': 1, 'square_footage': 311.80361967382737, 'number_of_bedrooms': 2, 'number_of_floors': 1}, 'household': [{'name': 'Oscar Berry', 'age': 25, 'gender': 'female'}, {'name': 'Mark Berry', 'age': 10, 'gender': 'female'}, {'name': 'Jacqueline Berry', 'age': 14, 'gender': 'male'}], 'address': '66 Lake Jamieview,PSC '} | 1483193676.51 |
2 | 90527688b31d445 | mobile_browser | NaN | Claim Started for customer | 99ccf1 | NaN | 1483193794.69 |
3 | c4013f44ea6d40c | mobile_browser | NaN | Payment Completed for customer | 99ccf1 | NaN | 1483193794.69 |
4 | 8045614075e7466 | pc_browser | NaN | Quote Started for customer | 9bae09 | NaN | 1483196400.00 |
Drop the transaction column cos its fucking useless, can count number of transactions by actions
# all transaction id's are unique, not very useful
# msgLog.transaction_id.value_counts()
msgLog = msgLog.drop(columns=['transaction_id'])
# 621123 unique ids, note that some customer ids are repeated
len(msgLog.customer_id.value_counts())
621123
Sort values by customer id (thereby grouping by customer id, but not reducing rows)
msgLog = msgLog.sort_values(by=['customer_id', 'timestamp'], kind = 'mergesort')
msgLog = msgLog.reset_index(drop=True)
# add paid dataframe to msgLog
paid = pd.DataFrame()
msgLog['message'], msgLog['paid'] = msgLog['message'].str.split('$').str
msgLog['paid'] = pd.to_numeric(msgLog['paid'])
msgLog.message.value_counts()
- paid 816301
with 632998
- reason : fraud 10494
Name: message, dtype: int64
print(np.nanmin(msgLog['paid']), np.nanmax(msgLog['paid']), np.nanmean(msgLog['paid']))
2642.61 27097.44 8710.884242638436
# backup good msgLog
#1*
backupMsgLog = msgLog.copy()
msgLog.head()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
device | message | action | customer_id | json_payload | timestamp | paid | |
---|---|---|---|---|---|---|---|
0 | mobile_app | NaN | Quote Started for customer | 000012 | NaN | 1520515553.72 | nan |
1 | mobile_app | with | Quote Completed for customer | 000012 | {'name': 'Kimberly Mathews', 'email': 'Kimberly [email protected]', 'gender': 'male', 'age': 53, 'home': {'type': 1, 'square_footage': 346.20595281589794, 'number_of_bedrooms': 5, 'number_of_floors': 1}, 'household': [{'name': 'Kristen Mathews', 'age': 14, 'gender': 'female'}, {'name': 'Randy Mathews', 'age': 9, 'gender': 'male'}], 'address': '16 Reidville, Richard Pine Suite '} | 1520516303.52 | nan |
2 | mobile_app | NaN | Quote Started for customer | 000023 | NaN | 1496195647.93 | nan |
3 | mobile_app | with | Quote Completed for customer | 000023 | {'name': 'James Terrell', 'email': 'James [email protected]', 'gender': 'male', 'age': 30, 'home': {'type': 1, 'square_footage': 404.2134657047327, 'number_of_bedrooms': 2, 'number_of_floors': 1}, 'household': [], 'address': '114 Lake Matthewmouth,PSC '} | 1496196488.27 | nan |
4 | phone_call | NaN | Quote Started for customer | 00003a | NaN | 1496494950.52 | nan |