Speech recognition systems produce a word sequence from an acoustic signal, but many applications require the word sequence to be additionally annotated for such things as emphasis, punctuation, or dialog acts. This annotation can be accomplished by statistical classifiers trained from hand-labeled data, but it is impractical to hand label training data for every new style and language. In this work, we investigate the use of existing out-of-domain speech corpora and textual data from the Web in order annotate speech in new target domains. We also investigate the use of domain adaptation methods that use unlabeled data from the new domain together with the labeled out-of-domain data.;In the first part, we investigate a set of domain adaptation methods via analysis, simulation, and experiments on document classification tasks. We analyze a "feature restriction" approach that uses only features found in the target domain, and we compare it with feature learning methods structural correspondence learning (SCL) (Blitzer et al. 2006) and latent semantic analysis (LSA). We show that these methods can be justified by similar assumptions. We then investigate instance weighting, analyzing its effect under regularized learning, and comparing weight estimation methods for document classification.;In the second part, we consider several spoken language annotation problems. We first investigate prosodic event detection across different speaking styles; degradation due to mismatched style is small, but no substantial improvement is achieved using out-of-the-box adaptation methods that we investigate. Next, we consider dialog act tagging across different languages, using machine translation. We find that feature restriction and SCL both improve recall of one type of dialog act (backchannels), by utilizing correlations between domain-specific words and utterance length. Finally, we investigate the use of Web-based textual conversations for detecting questions and sentence boundaries in spoken conversations. We show that adaptation methods such as bootstrapping and SCL can use unlabeled speech data to incorporate acoustic features, and have the capacity to improve performance of the text-trained model. Our work suggests approaches for using Web text to annotate speech, without hand-annotated speech training data.
展开▼