With ever more popularity of video web-publishing, many popular contents are being mirrored, reformatted, modi ed and republished, resulting in excessive content duplication. While such redundancy provides fault tolerance for continuous...
moreWith ever more popularity of video web-publishing, many popular contents are being mirrored, reformatted, modi ed and republished, resulting in excessive content duplication. While such redundancy provides fault tolerance for continuous availability of information, it could potentially create problems for multimedia search engines in that the search results for a given query might become repetitious, and cluttered with a large number of duplicates. As such, developing techniques for detecting similarity and duplication is important t o m ultimedia search engines. In addition, content providers might b e i n terested in identifying duplicates of their content for legal, contractual or other business related reasons. In this paper, we propose an e cient algorithm called video signature to detect similar video sequences for large databases such as the web. The idea is to rst form a signature" for each video sequence by selecting a small number of its frames that are most similar to a number of randomly chosen seed images. Then the similarity b e t ween any t wo video sequences can be reliably estimated by comparing their respective signatures. Using this method, we a c hieve 85 recall and precision ratios on a test database of 377 video sequences. As a proof of concept, we h a ve applied our proposed algorithm to a collection of 1800 hours of video corresponding to around 45000 clips from the web. Our results indicate that, on average, every video in our collection from the web has around ve similar copies.