Crowdsourcing Feature Discovery

Posted: Mar 18 2014, 1:16pm CDT | by , Updated: Mar 18 2014, 3:04pm CDT, in Technology News


This story may contain affiliate links.

Crowdsourcing Feature Discovery

By Ben Lorica

Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans(1) take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).

Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies.”

CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate(2) features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.

The modeling phase is a traditional machine-learning competition (entries compete on standard quantitative metrics), using data sets that incorporate features culled from the earlier phase. More than algorithms(3), companies gain access to models that incorporate ideas generated by teams of data scientists. CrowdAnalytix enriches data sets with features proposed by teams of data scientists, surfacing (potentially unconventional) ideas that may prove useful for their models.

(1) The key question that I pointed out in my earlier post was: can this approach scale? Panos Ipeirotis recently noted: “… Google Books and ReCAPTCHA project are really testing the scalability limits of this approach.”
(2) Judging is subjective, and is based on the “explanation and rationale” that accompany each feature.
(3) In the end, many teams who enter machine-learning competitions coalesce around a few algorithms (Random Forest is a favorite). Winners tend to distinguish themselves through feature engineering.

This post originally appeared on O’Reilly Data (“Crowdsourcing Feature discovery”). It’s republished with permission.

Source: Forbes

This story may contain affiliate links.


Find rare products online! Get the free Tracker App now.

Download the free Tracker app now to get in-stock alerts on Pomsies, Oculus Go, SNES Classic and more.

Latest News


The Author

Forbes is among the most trusted resources for the world's business and investment leaders, providing them the uncompromising commentary, concise analysis, relevant tools and real-time reporting they need to succeed at work, profit from investing and have fun with the rewards of winning.




comments powered by Disqus