首页> 外文会议>Conference on Storage and Retrieval for Media Databases 2001 Jan 24-26, 2001, San Jose, USA >Fusion of Visual and Audio Features for Person Identification in Real Video
【24h】

Fusion of Visual and Audio Features for Person Identification in Real Video

机译:视觉和音频功能的融合,用于真实视频中的人员识别

获取原文
获取原文并翻译 | 示例

摘要

In this research, we studied the joint use of visual and audio information for the problem of identifying persons in real video (i.e. TV programs). A person identification system, which is able to identify characters in TV shows by the fusion of audio and visual information, is constructed based on two different fusion strategies. In the first strategy, speaker identification is used to verify the face recognition result. The second strategy consists of using face recognition and tracking to supplement speaker identification results. To evaluate our system's performance, an information database was generated by manually labeling the speaker (audio part) and the main person's face (from images) in every I-frame of a video segment of TV show Seinfeld. By comparing the output from our system with our information database, we evaluated the performance of each of the analysis channels and their fusion. The results show that while the first fusion strategy has a slightly lower recall than the original face recognition system, it achieves the best identification precision among different algorithms. This suggests that such a strategy is suitable for applications where precision is much more critical than recall (e.g. security systems). The second fusion strategy, on the other hand, generates the best overall identification performance. It outperforms either of the analysis channels greatly in both precision and recall and is applicable to more general applications, such as, in our case, to identify persons in TV programs.
机译:在这项研究中,我们研究了视觉和音频信息的联合使用,以解决在真实视频(即电视节目)中识别人物的问题。基于两种不同的融合策略,构建了一个人识别系统,该系统可以通过视音频信息的融合来识别电视节目中的人物。在第一种策略中,说话人识别用于验证人脸识别结果。第二种策略包括使用面部识别和跟踪来补充说话者识别结果。为了评估我们系统的性能,通过在电视节目Seinfeld的视频片段的每个I帧中手动标记说话者(音频部分)和主要人物的脸部(图像)来生成信息数据库。通过将系统输出与信息数据库进行比较,我们评估了每个分析通道及其融合的性能。结果表明,尽管第一种融合策略的召回率略低于原始人脸识别系统,但它在不同算法中实现了最佳的识别精度。这表明这种策略适用于精度比召回要紧得多的应用程序(例如安全系统)。另一方面,第二种融合策略可产生最佳的整体识别性能。它在准确性和召回率上都大大优于任何一个分析渠道,并且适用于更一般的应用程序,例如在我们的案例中,可以识别电视节目中的人物。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号