首页> 外文会议>Workshop on storytelling;Annual meeting of the Association for Computational Linguistics >A Simple Approach to Classify Fictional and Non-Fictional Genres
【24h】

A Simple Approach to Classify Fictional and Non-Fictional Genres

机译:对虚构和非虚构类型进行分类的简单方法

获取原文

摘要

In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforementioned features is at par with that of a classifier containing the exhaustive feature set.
机译:在这项工作中,我们部署了逻辑回归分类器来确定给定的文档属于小说类型还是非小说类型。对于体裁识别,先前的工作提出了三类功能,即低级(字符级和标记计数),高级(词法和句法信息)和派生功能(类型标记比,平均单词长度或平均句子长度)。使用带有交叉验证的递归特征消除(RFECV)算法,我们对从布朗语料库文本中提取的19个特征(属于上述所有类)的详尽集合进行了特征选择实验。结果,两个简单特征即副词与形容词的数量之比和形容词与代词的数量之比被证明是最重要的。随后,我们的旨在对来自Brown and Baby BNC语料库的文档进行体裁识别的分类实验表明,仅包含上述两个特征的分类器的性能与包含详尽特征集的分类器的性能相当。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号