In this half-day CVPR 2010 course we discuss the problems of video search, present methods how to achieve state-of-the-art performance, and indicate how to obtain improvements in the near future. We give an overview of the developments and future trends in the field on the basis of the TRECVID competition -- the leading competition for video search engines run by NIST -- where we have consistently scored a top-three performance over the last five years.
The scientific topic of video search is dominated by five major challenges:
- the semantic gap between a visual concept and its lingual representation;
- the sensory gap between an object and it many appearances due to the accidental sensing conditions;
- the model gap between the amount of notions in the world and the capacity to learn them;
- the query-context gap between the information need and the possible retrieval solutions;
- the interface gap between the tiny window the screen offers to the amount of data;
We integrate the features and machine learning aspects into a complete concept-based video search engine, which has successfully competed in TRECVID. The system includes computer vision, machine learning, information retrieval, and human-computer interaction. We follow the video data as they flow through the computational processes. Starting from fundamental visual features, covering local shape, texture, color, motion and the crucial need for invariance. Then, we explain how invariant features can be used in concert with kernel-based supervised learning methods to arrive at a concept detector. We discuss the important role of fusion on a feature, classifier, and semantic level to improve the robustness and general applicability of detectors. We end our component-wise decomposition of video search engines by explaining the complexities involved in delivering a limited set of uncertain concept detectors to an inpatient user. For each of the components we review state-of-the-art solutions in literature, each having different characteristics and merits.
Comparative evaluation of methods and systems is imperative to appreciate progress. We discuss the data, tasks, and results of TRECVID, the leading benchmark. In addition, we discuss the many derived community initiatives in creating annotations, baselines, and software for repeatable experiments. We conclude the course with our perspective on the many challenges and opportunities ahead for the computer vision and pattern recognition community.
Lecture TopicsThe technical content of our short course on video search engines is organized as follows:
- Problem statement: social, business, and scientific,
- Course organization: fundamentals, fusion, retrieval, evaluation.
- Invariance: the sensory and semantic gap,
- Local shape: Gaussians, Gabors, and Loweans,
- Texture: natural image statistics, gradients, Weibulls
- Color: light source, reflection, and representation,
- Motion: optic flow, tracking.
- Concept detection: compact feature representations, kernel-based supervised learning, the model gap,
- Feature fusion: synchronization, normalization, transformation, and concatenation,
- Classifier fusion: supervised and unsupervised methods,
- Semantic fusion: graphical models, data mining, and ontologies,
- Search engine architectures: component optimization, process-optimization.
- Large-scale concept detection: annotation efforts, detector performance,
- Translating queries to detectors: textual, visual, semantic, and their combination,
- Interacting with the user through the interface gap: browsing and learning.
- NIST TRECVID Benchmark: data, tasks, and results,
- Benchmark criticism: broad-domain applicability, repeatability, VideOlympics showcase,
- Resources: annotations, baselines, and software,
- Demonstration of the MediaMill Semantic video search engine.
- Concluding remarks: achievements and discussion,
- Future work: challenges and opportunities for the computer vision and pattern recognition community.
The lecture slides, including pointers to data sets, software, video's, as well as several general references are available here.
Several relevant papers are listed on our publication server.
Cees G.M. Snoek received the M.Sc. degree in business information systems (2000) and the Ph.D. degree in computer science (2005) both from the University of Amsterdam, The Netherlands, where he is currently a senior researcher at the Intelligent Systems Lab Amsterdam. He was a Visiting Scientist at Informedia, Carnegie Mellon University, USA in 2003. His research interests focus on multimedia signal processing and analysis, statistical pattern recognition, content-based information retrieval, social media retrieval, and large-scale benchmark evaluations, especially when applied in combination for video retrieval. He has published over 70 refereed book chapters, journal and conference papers in these fields, and serves on the program committee of several conferences. Dr. Snoek is a lead researcher of the award-winning MediaMill Semantic Video Search Engine, which is a consistent top performer in the yearly NIST TRECVID evaluations. He is initiator and co-organizer of the annual VideOlympics, and was the local chair of the 2007 ACM International Conference on Image and Video Retrieval. He is a lecturer of post-doctoral courses given at international conferences and European summer schools. He is a member of ACM and IEEE. Dr. Snoek received a young talent (VENI) grant from the Netherlands Organization for Scientific Research in 2008.
Arnold W.M. Smeulders graduated from Technical University of Delft in physics in 1977 (M.Sc.) and in 1982 from Leyden University in medicine (Ph.D.) on the topic of visual pattern analysis. In 1994, he became full professor in multimedia information analysis at the University of Amsterdam. He has an interest in cognitive vision, content-based image retrieval, the picture-language question as well as in systems for the analysis of video. He has written over 250 papers in refereed journals and conferences. He received a Fulbright grant at Yale University in 1987, and a visiting professorship at the City University Hong Kong in 1996, and again at Tsukuba Japan in 1998. In 2000, he was elected fellow of International Association of Pattern Recognition. He was associated editor of IEEE Transactions PAMI. Currently he is associated editor of the International Journal for Computer Vision as well as the IEEE Transactions Multimedia. He is a member of the steering committee of the IEEE's International Conference on Multimedia and Expo series. He participates in the DELOS and MUSCLE networks of excellence of the EU. He was keynote speaker and chairman of the program committee of conferences including the IEEE Multimedia conference in Florence in 1999, ICIP 2000, CVPR in 2001 and CIVR in 2004 in Dublin. He was general chair of ICME2005 in Amsterdam. In 1996, he was treasurer of the Faculty and director of the Informatics Institute at the University of Amsterdam. Currently, he is scientific director of the Intelligent Systems Lab Amsterdam of 65 staff members, the MultimediaN national public-private partnership of 30 institutions and companies, and of the national research school ASCI. He has graduated 32 PhD-students.