Tutorial ICCV09

Human-centered Vision Systems

Kyoto University Clock Tower Centennial Hall, room TBA

September 27, 2009, 14:00-17:15

SLIDES

Nicu Sebe, University of Trento, Italy

sebe@disi.unitn.it
Hamid Aghajan, Stanford University, USA

aghajan@stanford.edu


 
SYNOPSIS
This tutorial will take a holistic view on the research issues and applications of Human-Centered Vision Systems focusing on three main areas: 
multimodal interaction: visual (body, gaze, gesture) and audio (emotion) analysis; 
smart environments; 
distributed and collaborative 
fusion of visual information.  
 
MOTIVATION
Human-computer Interaction lies at the crossroads of many research areas (computer vision, multimedia, psychology, artificial intelligence, 
pattern recognition, etc.) and is used in a wide range of applications. In particular, we are aiming at developing human-centered information 
systems. The most important issue here is how to achieve synergism between man and machine. The term “human-centered” is used to 
emphasize the fact that although all existing vision systems were designed with human users in mind, many of them are far from being 
user friendly. What can the scientific/engineering community do to affect a change for the better?
 
On the one hand, the fact that computers are quickly becoming integrated into everyday objects (ubiquitous and pervasive computing) implies 
that effective natural human-computer interaction is becoming critical (in many applications, users need to be able to interact naturally with 
computers the way face-to-face human-human interaction takes place). On the other hand, the wide range of applications that use multimedia, 
and the amount of multimedia content currently available, imply that building successful computer vision and multimedia applications requires 
a deep understanding of multimedia content. The success of human-centered vision systems, therefore, depends highly on two joint aspects: 
(1) the way humans interact naturally with such systems (using speech and body language) to express emotion, mood, attitude, and attention, 
and (2) the human factors that pertain to multimedia data (human subjectivity, levels of interpretation).
 
In this tutorial, we take a holistic approach to the human-centered vision systems problem. We aim to identify the important research issues, 
and to ascertain potentially fruitful future research directions in relation to the two aspects above. In particular, we introduce key concepts, 
discuss technical approaches and open issues in three areas: (1) multimodal interaction: visual (body, gaze, gesture) and audio (emotion) 
analysis; (2) smart environments; (3) distributed and collaborative fusion of visual information.
 
The tutorial sets forth application design examples in which a user-centric methodology is adopted across the different stages from feature 
and pose estimation in early vision to user behavior modeling in high-level reasoning. The role of query for user’s feedback will be discussed 
with examples in smart home applications. Several implemented applications based on the notion of user-centric design will be introduced 
and discussed. The focus of the short course, therefore, is on technical analysis and interaction techniques formulated from the perspective 
of key human factors in a user-centered approach to developing Human-Centered Vision Systems.
 
BENEFITS & LIST OF TOPICS
This tutorial will enable the participants to understand key concepts, state-of-the-art techniques, and open issues in the areas described below. 
In relation to the conference, the tutorial will cover parts of the following topic areas:
New paradigms for HCI: smart environments, smart networked objects, augmented + mixed realities, ubiquitous computing, pervasive 
computing, tangible computing, intelligent interfaces and wearable computing.
Vision for smart environments: overview of techniques and state of the art in body tracking and pose, gaze detection, etc.
Multi-camera networks: user activity and behavior modeling, smart homes, occupancy-based services, distributed and collaborative processing.
Multimodal emotion recognition for affective retrieval and in affective interfaces: approaches to multimedia content analysis and 
interaction that use speech and facial expression recognition.
Machine learning: adaptive multimodal interfaces and learning of visual concepts from user input for automatic detection and recognition 
(detection of scenes, objects, or events of interest).
Multimodal fusion: technical approaches and issues in combining multiple media (e.g., audio-visual) for multimodal interaction and 
multimedia analysis.
Interfaces between vision processing module and high-level reasoning, the role of feedback to vision, knowledge accumulation, user 
behavior modeling, environment discovery
Applications: traditional and emerging application areas will be described with specific examples in smart conference room research, 
arts, interaction for people with disabilities, entertainment, and others.
 
INTENDED AUDIENCE
The short course is intended for PhD students, scientists, engineers, application developers, computer vision specialists and others interested in the 
areas of information retrieval and human-computer interaction. A basic understanding of image processing and machine learning is a prerequisite.