Fusion of Visual and Audio Features for Person Identification in Real Video

机译：视觉和音频功能的融合，用于真实视频中的人员识别

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this research, we studied the joint use of visual and audio information for the problem of identifying persons in real video (i.e. TV programs). A person identification system, which is able to identify characters in TV shows by the fusion of audio and visual information, is constructed based on two different fusion strategies. In the first strategy, speaker identification is used to verify the face recognition result. The second strategy consists of using face recognition and tracking to supplement speaker identification results. To evaluate our system's performance, an information database was generated by manually labeling the speaker (audio part) and the main person's face (from images) in every I-frame of a video segment of TV show Seinfeld. By comparing the output from our system with our information database, we evaluated the performance of each of the analysis channels and their fusion. The results show that while the first fusion strategy has a slightly lower recall than the original face recognition system, it achieves the best identification precision among different algorithms. This suggests that such a strategy is suitable for applications where precision is much more critical than recall (e.g. security systems). The second fusion strategy, on the other hand, generates the best overall identification performance. It outperforms either of the analysis channels greatly in both precision and recall and is applicable to more general applications, such as, in our case, to identify persons in TV programs.

机译：在这项研究中，我们研究了视觉和音频信息的联合使用，以解决在真实视频（即电视节目）中识别人物的问题。基于两种不同的融合策略，构建了一个人识别系统，该系统可以通过视音频信息的融合来识别电视节目中的人物。在第一种策略中，说话人识别用于验证人脸识别结果。第二种策略包括使用面部识别和跟踪来补充说话者识别结果。为了评估我们系统的性能，通过在电视节目Seinfeld的视频片段的每个I帧中手动标记说话者（音频部分）和主要人物的脸部（图像）来生成信息数据库。通过将系统输出与信息数据库进行比较，我们评估了每个分析通道及其融合的性能。结果表明，尽管第一种融合策略的召回率略低于原始人脸识别系统，但它在不同算法中实现了最佳的识别精度。这表明这种策略适用于精度比召回要紧得多的应用程序（例如安全系统）。另一方面，第二种融合策略可产生最佳的整体识别性能。它在准确性和召回率上都大大优于任何一个分析渠道，并且适用于更一般的应用程序，例如在我们的案例中，可以识别电视节目中的人物。

著录项

来源
《Conference on Storage and Retrieval for Media Databases 2001 Jan 24-26, 2001, San Jose, USA》|2001年|p.180-187|共8页
会议地点 San Jose CA(US)
作者
Dongge Li; Gang Wei; Ishwar K. Sethi; Nevenka Dimitrova;
展开▼
作者单位

Department of Computer Science Wayne State University, Detroit, MI 48202;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词
audio-visual analysis; speaker identification; face recognition; MAP estimator;

机译：视听分析；说话人识别；人脸识别; MAP估算器;

相似文献

外文文献
中文文献
专利

1. ROBUST MULTIMODAL PERSON RECOGNITION USING LOW-COMPLEXITY AUDIO-VISUAL FEATURE FUSION APPROACHES [J] . DHAVAL SHAH, KYU J. HAN, SHRIKANTH S. NARAYANAN International journal of semantic computing . 2010,第2期

机译：基于低复杂度视听特征融合方法的鲁棒多模态人员识别
2. Semantic analysis based on fusion of audio/visual features for soccer video [J] . Zengkai Wang Procedia Computer Science . 2021,第1期

机译：基于足球视频的音频/视觉功能融合的语义分析
3. Video-based person re-identification using a novel feature extraction and fusion technique [J] . Wanru Song, Jieying Zheng, Yahong Wu, Multimedia Tools and Applications . 2020,第17a18期

机译：基于视频的人使用新颖的特征提取和融合技术重新识别
4. Fusion of Visual and Audio Features for Person Identification in Real Video [C] . Dongge Li, Gang Wei, Ishwar K. Sethi, Conference on storage and retrieval for media databases . 2001

机译：真实视频中的人识别的视觉和音频功能的融合
5. Real-time Video Alignment and Fusion Using Feature Detection on FPGA Devices [D] . Taglang, Robert Haywood. 2017

机译：FPGA器件上使用功能检测的实时视频对齐和融合
6. Perceptual Doping: An Audiovisual Facilitation Effect on Auditory Speech Processing From Phonetic Feature Extraction to Sentence Identification in Noise [O] . Shahram Moradi, Björn Lidestam, Elaine Hoi Ning Ng, -1

机译：知觉兴奋剂：从语音特征提取到噪声中的句子识别对听觉语音处理的视听促进作用
7. Combination of SVM and Score Normalization for Person Identification based on audio-visual feature fusion [O] . 2015

机译：基于视听特征融合的sVm与人格识别评分归一化相结合

Fusion of Visual and Audio Features for Person Identification in Real Video

摘要

著录项

相似文献

相关主题

期刊订阅