At Polychart we are passionate about visualizing data. This blog features product news, articles, as well as our own analyses and visualizations of datasets we find interesting.

Tuesday 29 January 2013

A Scoreboard for a 21st Century Sport - Kaggle Data Science Competitions

Success in the new sport of data science is a different race, with the finish line marked not by touchdowns or goals. Instead, its won by having the best fitness metric such as the lowest AUC or the highest mean average precision. The game is played over who can build the best model, leveraging past data to make the most accurate future predictions. Its competitive - head over to kaggle.com and see for yourself. One contest aimed at predicting future hospital patients has a 3 million dollar grand prize. Another contest offers its winners coveted jobs at Facebook.

To win, competitors must understand their data better than anyone else. Besides knowing what type of algorithm to use for modeling, being able to find insights such as the hidden correlations in the variables can mean the difference between being a winning contender and just an another entry.

The game starts with the release of a dataset and attached problem summary. Competitors have anywhere from weeks to months to build their predictive model. Multiple submissions can be made - a score is given back each time which can be compared against the current scoreboard. Its a dash to the best results.

The mix of team scores and submission times reveals a fascinating story. Some teams submit often, each time making small improvements. Other teams get stuck - after a set of improved submissions their score seems to level off and no more improvements are made. Brilliant flashes of insight can happen at any time, as shown by huge jumps in scores from one submission to the next. Many teams only have one or two submissions, preferring to wait until they have the perfect model before hedging their bets.

We are recreating the Kaggle leader-board using Polychart JS this week for the Leaping Leaderboard Leapfrogs challenge. The old scoreboard is a simple ranking that fails to capture the spirit of the competitions. We are visualizing the struggle to be the best data scientist, the accumulation of thousands of hours of hard work. The contest we are visualizing is the Predict HIV Progression Challenge, where contestants aim to find markers in the HIV sequence which predict a change in the severity of the infection.

No comments:

Post a Comment