Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning

14 years 2 months ago

Download colt2008.cs.helsinki.fi

We study the potential benefits to classification prediction that arise from having access to unlabeled samples. We compare learning in the semi-supervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are better than those obtainable without access to unlabeled data, unless one postulates very strong assumptions about the distribution of the labels. In particular, we prove that for basic hypothesis classes over the real line, if the distribution of unlabeled data is `smooth', knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor (e.g., 2). We conjecture that a similar phenomena holds for any hypothesis class and any unlabeled data distribution. We also discuss the utility of semi-supervised learning under the common cluster a...

Shai Ben-David, Tyler Lu, Dávid Pál

Real-time Traffic

COLT 2008 | Machine Learning | Unlabeled Data | Unlabeled Data Distribution | Unlabeled Samples |

claim paper

Post Info
More Details (n/a)

Added	18 Oct 2010
Updated	18 Oct 2010
Type	Conference
Year	2008
Where	COLT
Authors	Shai Ben-David, Tyler Lu, Dávid Pál

Comments (0)

Sciweavers

Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning

COLT 2008 | Machine Learning | Unlabeled Data | Unlabeled Data Distribution | Unlabeled Samples |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers