Search This Blog

Wednesday, August 3, 2016

The Smart Recruit

AnalyticsVidhya organised a weekend hackathon called The Smart Recruit, which was held on 23rd-24th July, 2016.

I won the previous hackathon, The Seer's Accuracy, and was hoping to do well in this one too.

The problem was to identify which agents would be successful in making sales of financial products. So, it was a binary classification problem.

Like the previous hackathon, the data seemed quite good and promising.

Train and test data consisted of agent applications with data about the application, the manager and a few related features about them.

I'm sure most participants just went ahead and dumped the data into XGB-type models with a lot of scores hovering in the 0.63 - 0.66 AUC range.

I tried to get a robust/stable validation framework, like I mentioned in AV's article on Winning Tips. Didn't seem to work/help. The CV/LB scores were all over the place in my first few submissions.

Thats when I decided to take a step back and inspect the data in detail. It was evident to me that there would be a huge LB shake-up due to the variance between the CV-LB scores. Hence, didn't make much sense to spend too much time on the data trying to optimize models. Instead, I tried to look for some pattern/feature which could boost me score over the expected error margins.

And that's exactly what happened. A simple plot of the target variable showed a pattern, which seemed too good to be true. I tried a feature using this and my CV jumped to 0.8... and that was the feature that ultimately proved to be the winning one.

Here's the plot that changed everything:

This is the plot of the target variable for the first four days. A clear pattern exists where you see most of the 1's at the beginning of a day and most of the 0's at the end of the day. You can plot the target variable of any single day and observe a similar trend.

Leakage? Possible. Hidden trend? Possible. At first I was convinced it was leakage and a data preparation issue, but later, felt there was a possibility that applications received towards the end of the day are more likely to be rejected than ones received early.

Either ways, I polished this feature using Order_Percentile in my code, which was the most important feature.

My final model was a single XGBoost with 14 features, with the other 13 being cleaned up features from the raw variables. I achieved a CV of 0.887 which was in the same range as the LB. I'd have liked to try out some more parameter tuning and ensembling, but with the limited duration of a hackathon, there wasn't any time left.

View My Complete Solution

I stood 1st on the public LB with 0.885, with good friend and rival competitor SRK in 2nd, who teamed up with Kaggler Mark Landry, with 0.876 and another team of Kanishk Agarwal and Yaasna Dua in 3rd with 0.839. No other team figured out the winning feature and their scores were below 0.71.

The rankings held same on the private LB, but it was much closer, with SRK-Mark scoring 0.7647 and I scoring 0.7658.

My username is 'vopani'.

View Complete Results

My 2nd AV win on the trot and while not the best way to win it, I'm happy I could find a useful winning feature in the data.

Congrats to the ever consistent SRK, who also happens to be someone I'm chasing on Kaggle :-)

Fun weekend, bonus to win it, and looking forward to the next hackathon, where I'll be on a hat-trick!

An interesting co-incidence: I got the exact same score on the public LB (0.8856) in the previous hackathon too, The Seer's Accuracy !!!

External Links
View AV article on the winners
View 2nd place solution by SRK
View 3rd place solution by Kanishk Agarwal