Biometric Authentication: Multicue data fusion—Part I

S.Y. Kung, M.W. Mak, S.H. Lin -October 04, 2012

10.1 Introduction

Various research studies have suggested that no single modality can provide an adequate solution for high-security applications. These studies agree that it is vital to use multiple modalities such as visual, infrared, acoustic, chemical sensors, and so on.

The problem of combining the power of several classifiers is of great importance to various applications. In many remote sensing, pattern recognition, and multimedia applications, it is not uncommon for different channels or sensors to facilitate the recognition of an object. In addition, for many applications with very high-dimensional feature data, fusion of multiple modalities provides some computational relief by dividing feature vectors into several lower-dimensional vectors before integrating them for final decision (i.e., the divide-and-conquer principle). Consequently, it is very important to develop intelligent and sophisticated techniques for combining information from different sensors [1].

To cope with the limitations of individual biometrics, researchers have proposed using multiple biometric traits concurrently for verification. Such systems are commonly known as multimodal verification systems [183]. By using multiple biometric traits, systems gain more immunity to intruder attack. For example, it is more difficult for an impostor to impersonate another person using both audio and visual information simultaneously. Multicue biometrics also helps improve system reliability. For instance, while background noise has a detrimental effect on the performance of voice biometrics, it does not have any influence on face biometrics.

Conversely, although the performance of face recognition systems greatly depends on lighting conditions, lighting does not have any effect on voice quality (see Problem 1 for a numerical example). As a result, audio and visual (AV) biometrics has attracted a great deal of attention in recent years. However, multiple biometrics should be used with caution because catastrophic fusion may occur if the biometrics is not properly combined; such fusion occurs when the performance of an ensemble of combined classifiers is worse than any of the individual classifiers.

Biometric pattern recognition systems must be computationally efficient for real-time processing, and VLSI DSP architecture has made it economically feasible to support such intelligent processing in real-time. Neural networks (NNs) are particularly suitable for such real-time sensor fusion and recognition because they can easily adapt in response to incoming data and take special characteristics of individual sensors into consideration.

This chapter proposes and evaluates novel neural network architecture for effective, efficient fusion of signals from multiple modalities. Taking the perspective of treating the information pertaining to each sensor as a local expert, hierarchical NNs offer a very attractive architectural solution for multi-sensor information fusion. A hierarchical NN comprised of many local classification experts and an embedded fusion agent (i.e., gating network) will be developed for the architecture and algorithm design. In this context, the notion of mixture-of-experts (MOE) offers an instrumental tool for combining information from multiple local experts (see Figure 6.3). For effective, efficient local experts, the decision-based neural network (DBNN) is adopted as the local expert classification module. The proposed hierarchical NN can effectively incorporate the powerful expectation-maximization (EM) algorithm for adaptive training of (1) the discriminant function in the classification modules and (2) the gating parameters in the fusion network. 

This chapter also shows why such a hierarchical NN is not only cost-effective in terms of computation but also functionally superior in terms of recognition performance. In addition to the fusion of data collected from multiple sensors, it is possible to fuse the scores of multiple samples from a single sensor. This chapter details a novel approach to computing the optimal weights for fusing scores, based on the score distribution of independent samples and prior knowledge of the score statistics. Evaluations of this multi-sample fusion technique on speaker verification, face verification, and audio and visual (voice plus face) biometric authentication are reported.


10.2 Sensor Fusion for Biometrics

Sensor fusion is an information processing technique (see [66, 125]) through which information produced by several sources can be optimally combined. The human brain is a good example of a complex, multi-sensor fusion system; it receives five different signals: sight, hearing, taste, smell, and touch from five different sensors: eyes, ears, tongue, nose, and skin. Typically, it fuses signals from these sensors for decision-making and motor control. The human brain also fuses signals at different levels for different purposes. For example, humans recognize objects by both seeing and touching them; humans also communicate by watching the speaker's face and listening to his or her voice at the same time. All of these phenomena suggest that the human brain is a flexible and complicated fusion system.

Research in sensor fusion can be traced back to the early 1980s [17, 348]. Sensor fusion can be applied in many ways, such as detection of the presence of an object, recognition of an object, tracking an object, and so on. This chapter focuses on sensor fusion for verification purposes. Information can be fused at two different levels: feature and decision. Decision-level fusion can be further divided into abstract fusion and score fusion. These fusion techniques are discussed in the following two subsections.

Loading comments...

Write a Comment

To comment please Log In