Search This Blog

Wednesday, November 12, 2014

Tradeshift Text Classification

The Tradeshift Text Classification competition was held on Kaggle from 2nd October, 2014 to 10th November, 2014.

Tradeshift has a dataset of thousands of documents, and groups of words are assigned certain labels, eg: Date, Address, Name, etc. The challenge was to create an automated model that predicts which label a certain group of words belongs to.

The train data consisted of 1.7 million rows along with its correct label(s). There were 145 variables containing various attributes about the group of words and also regarding some of the surrounding words/text. There were 33 different possible labels.
The test data consisted of ~ 0.5 million rows for which we had to predict the labels.

I started off with an absolutely beautiful and elegant online logistic regression code by tinrtgu. Well, so wonderful was this model that most of the top competitors started off and finally used it as part of their best models. Here's the kicker: at the time of sharing the code, it was powerful enough to get into 1st place!

The online model one-hot encodes all the variables, so I rounded off all the numeric variables to one decimal since 'similar' values should ideally be treated the same.

Built Random Forest for y33, the most important label. I took some of the most important variables and tried some interactions with the hash variables for the online code.
Then built RF for all labels individually. Too intensive! Took me over a day across two PCs!

Added the RF-predictions into the online code, and bang! that was it. This submission got me into the top-10.

Tried some other models, but without much success. XGBoost gave promising results, and I added the XGB-predictions into the online code.

Final Model
My final model was M1-M2-M3-M4 into the online code which got me a score of 0.0049356 / 0.0050160 on the LB and would've been ranked in the twenties.

'would've been'? That's right. Abhishek Thakur teamed up with me and we tried some ideas and models together. I don't remember the last ditch ideas that Abhishek tried (I'll update soon), but the final ensemble with a score of 0.0048200 / 0.0048783 certainly helped us secure 13th rank out of the 375 teams.

We tried some other models like GBM, SGD, etc. and some other features, variables, tweaking/tuning of parameters, without much success.

This was the first competition where I came up with a very competitive result and its given me the confidence of coming up with more in the future.

The CV results and LB scores were very close, consistent, and it was an absolutely perfect data-set. The online code is what I'm taking away from this competition, it is now one of my favourite models :-) (Thank You tinrtgu).

Congrats to the Chinese team of rcarson and Xueer Chen and also to the second place team of three French guys who all did a fabulous job of entertaining us to the last day when they had the exact same score on the Public LB! I mean, seriously, how close can you get. Its unfortunate this competition has only one prize, but hey, that's life.

Now my overall Kaggle rank is 206th! Yay! My career-best. You can View My Kaggle Profile.
I'm hoping to get into the top-100 early next year, and hopefully the top-50 (or top-25?) by the end of 2015.

So, lots more to come!

Check out My Best Kaggle Performances