Search This Blog

Monday, November 23, 2015

Black Friday Data Hack


AnalyticsVidhya organized a weekend hackathon called Black Friday Data Hack, which was held on 20th-22nd November, 2015.

Black Friday is actually the following weekend, but that's when we've to relax and enjoy :-)

The last hackathon was quite disappointing due to the randomness in the data and the evaluation metric. I was hoping this one would be better.
And it was. Much better.

Problem
The challenge was to predict the purchase amount of various products by users across categories given historic data of purchase amounts.

Data
In general, when you have more data, its always better. The train data had ~ 5.5 lakh observations and the test data had ~ 2.3 lakh observations. The data was very very clean and it feels wonderful to work on such datasets.

The data was of users who purchased products with the amounts. The products had data on three types of categories. The users had data about their age, gender, city, occupation, locality and marital status.

We were to build our models on the train data and score the test data which had pairs of user-product not present in the train data. The evaluation metric was RMSE, which also seemed a very appropriate choice for this problem.

Approach
I spent the first few hours just exploring the data, summarizing variables, plotting graphs, playing around with pivots and in parallel building base models (of course, XGBoost).

On the first day, I was able to go below 2500 with an optimized XGBoost model on raw features. It got me into the Top-3 and since then I've managed to maintain a position in the Top-5.

While checking the variable importance of my XGBoost, I found Product_ID was the most important variable and intuitively it made sense. So, I just submitted the average purchase amount of each product and voila! it scored 2682, which didn't seem like a very bad score. So, all those of you who couldn't cross 2682, here's a simple solution you missed.

Usually ensembles win competitions, but since I couldn't get any model close to the performance of XGB, so I decided to challenge myself to build a single powerful model. Which means, feature engineering.
These two days gave me some wonderful insights on how powerful feature engineering is. With some analysis, gut, trying, cross-validating, here are my final set of features that I used:

Model
User_ID: Used as a raw feature

User_Count: Number of observations of the user

Gender: Converted to binary

Age: Converted to numeric

Marital Status: Used as raw feature

Occupation: Used as raw feature

City Category: One-hot encoded features

Stay In Current City: Converted to numeric

Product Category 1, 2, 3: Used as raw feature

Product_Count: Number of observations of the product

Product_Mean: Average purchase amount of product

User_High: Proportion of times the user purchases products at a higher amount than the average purchase amount of the product

I built an XGBoost with these features, and the code is open-sourced on GitHub, the link is given below.

One very interesting feature I built was
F_Prop: Average purchase amount of product by female users / Average purchase amount of product by male users

This was among the top-3 important variables and gave a CV of ~ 2419 but the LB remained very similar ~ 2430, so I wasn't sure about it. I decided to go without this.

GitHub
View GitHub Repository

Results
This model gave me CV score of ~ 2425 and public LB score of 2428. I was 4th on the public LB, with Jeeban, Nalin and Sudalai in the Top-3. And we finished in the same positions with my final rank being 4th in the private LB.

Views
This is one of the best data-sets I've worked on in a while. The CV and LB scores were perfectly in sync and it was very satisfying to build features and improve the CV as well as LB scores. I'm happy with my performance as I managed to squeeze quite a lot of from the data with a single model.

I might have done better with an ensemble, but just couldn't get anything to work well. And after a while, was just too tired.

Overall, a great weekend, mostly spent on my laptop. For those of you who had memory issues, I worked on my 4GB MacBook Air throughout the weekend. Algorithms and models will advance and become optimized every day, but the power of building good features is still in the hands of Data Scientists like us.
Make the most of it until the machines come and take over ;-)

Thanks to all the folks at AnalyticsVidhya for organizing this hackathon. A big thumbs up from me.

Looking forward to the next Hackathon, and hope it gets better and more competitive.

External Links
View Other Players' Approaches on AnalyticsVidhya
View 3rd place solution code on GitHub by Sudalai Raj Kumar
View 5th place solution code on GitHub by Aayush Agrawal
 

11 comments:

  1. Thanks Rohan for being part of the hackathon. Glad you liked the competition and the data set. We will strive every bit to make the next one bigger and better!

    A weekend well spent!

    ReplyDelete
    Replies
    1. hi Kunal. can you please explain what do the product categories explain in this dataset? do they signify different products in their respective categories or just the number of products in that?

      Delete
  2. Thanks for posting this blog. You have given a good insight on data science, more so in feature engineering.

    ReplyDelete
  3. Thanks. Hope for more codes in Python

    ReplyDelete
  4. Thank you Rohan for sharing your approach. It helps as a guide for budding Data Scientists like me :-)..

    ReplyDelete
  5. Thanks rohan for posting it . Its really nice for people like us who are starting career in data science .

    ReplyDelete
  6. Hi Rohan , can you please tell me how you tuned the model ?
    My laptop had 2GB ram and its slow . I am unable to tune .Have you used Amazon EC2 if yes , can you just guide me to use that.

    ReplyDelete
  7. hi Rohan Thank you for sharing such an intuitive blog about the modelling process. i am trying to figure out what do the observations in Product category1,2,3 mean. are they categorical or should we treat them as continuous? can you pls help me out?

    ReplyDelete
  8. Can anyone explain why 19 and 20 are dropped from product category 1 ?

    ReplyDelete
  9. In the code i see that User Count, count of product have been calculated only for the training data and it has been applied to the test data as it is.

    Shouldn't it be calculated for both train and test data put together and then applied to the train and test data sets.

    ReplyDelete
  10. Hello Rohan,

    How did you utlize User ID. Did you do a one hot encoding ?

    Thanks

    ReplyDelete