We present an automatic tool for compact representation and cross-referencing of long video sequences, which is based on a novel visual abstraction of semantic content. Our highly compact hierarchical representation results from the non-temporal clustering of scene segments into a new conceptual form grounded in the recognition of real-world backgrounds. We represent shots and scenes using mosaics and employ a novel method for the comparison of scenes based on these representative mosaics. We then cluster scenes together into a higher level of abstraction - the physical setting. We demonstrate our work using situation comedies (sitcoms), where each half-hour episode is well structured by rules governing background use. Consequently, browsing, indexing and comparison across videos by physical setting is very fast. Further, we show that physical settings lead to a higher-level contextual identification of the main plots in each video. We demonstrate these contributions with a browsing tool whose top-level single page displays the settings of several episodes. This page expands to display windows for each episode, and each episode menu summary is further expanded into scenes and shots, all by mouse-clicking on appropriate plots and settings according to user interests.
展开▼