For a dozen or so diverse
languages
Text corpus (~1-10MW)
Tagged subcorpus (.1-1MW)
for training and/or testing
Broad-coverage analyzer/synthesizer
generating data for (semi-)supervised
learning
oracle for active learning
Tagger
generating approximately correct tagged
data