The Internet is a valuable source of labeled data. Some websites explicitly call on the general public to perform simple labeling tasks, while others passively collect logs of user actions. In either case, the result is a dataset that is labeled by many different individuals (“teachers”). The quality of each teacher depends on his expertise, competence, and motivation, and some teachers may even be malicious. In this case, we cannot assume that the training data represents the distribution that we actually care about. Instead, we would like to learn a cleansed version of the training distribution, and our goal is to identify bad teachers and reduce the damage they cause. We call this setting: “Learning from a Crowd”.
In this talk, I will present two new theoretically-motivated algorithms designed to deal with datasets that are labeled by crowds. These algorithms do not require any prior knowledge on each teacher, nor do they rely on repeated labeling (where each example is labeled by multiple teachers) or on the existence of a subset of examples for which the correct labels are known. The first algorithm is a data cleaning algorithm that identifies low-quality teachers and removes them, even when each teacher labels only a handful of examples. The second algorithm is an SVM variant designed to learn in the presence of malicious teachers, by identifying them and reducing their influence on the final hypothesis. I will present the theoretical motivation behind the algorithms, as well as promising experimental results on a real dataset collected online.
Joint work with Ohad Shamir (The Hebrew University).