In computer vision, person re-identification is the task of identifying and monitoring people moving across a number of non-overlapping cameras. Several factors like significant changes in viewing angle, lighting, background clutter, and occlusion cause features to vary a lot from camera to camera. The talk will be about the following research questions towards a scalable and improved person re-identification.
The first question was – Can we model the way features get transformed between cameras? Can we also learn the way feature `does not’ get transformed and tell if a image pair (from separate cameras) is coming from the same person or not? The similarity between the feature histograms and time series data motivated us to apply the principle of Dynamic Time Warping to study the transformation of features by warping the feature space. After capturing the feature warps, the variabilities of the warp functions were modeled as a function space of feature warps. The function space not only allowed us to model feasible transformation between pairs of instances of the same target, but also to separate them from the infeasible transformations between instances of different targets.
Existing person re-identification methods are camera pairwise where the focus is on finding similarities of persons between pairs of cameras. While this works well for a 2 camera network, it introduces inconsistency of re-identification when a network consisting of 3 or more cameras are considered. The inconsistency is due to the possible differences between the result of a direct match of one person from one camera (say camera A) to another camera (say camera B) and the result of a series of sequential matches starting with the same person in the former camera (A) and ending in the later (B), after going through a set of intermediate cameras. We asked two questions here. Can the results be made consistent? and Will re-identification performance be improved by enforcing consistency? We addressed the problem by posing re-identification as an optimization that minimizes the global cost of associating pairs of targets on the entire camera network constrained by a set of consistency criterion.
Most traditional multi-camera re-identification systems rely on learning a static model on tediously labeled training data. For large multi-sensor data as typically encountered in person re-identification, labeling lot of samples is not only an overhead but does not always mean more information, due to redundant labeling. Thus the next questions we asked are: Is it possible to select a manageable set of training images for annotation while maintaining good re-identification performance? Moreover, is it possible to select these examples progressively in an online setting where all the training data may not be available a priori? We propose a convex optimization based iterative framework that progressively and judiciously chooses a sparse but informative set of samples for labeling, with minimal overlap with previously labeled images. The framework not only helps in reducing the labeling effort but also can handle situations when new unlabeled data arrives continuously.
The talk is concluded with some insight into possible future directions leveraging on the strengths of active sample selection and that of enforcing consistency in a camera network.
Re-Identification in the Function Space of Feature Warps
A. Das, N. Martinel, C. Micheloni, A. Roy-Chowdhury; IEEE Trans. on Pattern Analysis and Machine Intelligence, 2015.
Active Image Pair Selection for Continuous Person Re-identification
A. Das, R. Panda, A. Roy-Chowdhury; IEEE International Conference on Image Processing, 2015.
The main idea is feature transformation.
WARD is a 3 camera dataset with 70 people in each. They calculated cumulative matching characteristic (CMC) camera pair
Existing person re-identification strategies are camera pair specific.
High performance in camera pairwise person re-identification does not always mean consistent re-identification.
Cost function = argmax (\sum_p,q=1 \sum_i,j=1 c_i,j x_i,j
c_i,j similarity score between persons
i and j in camera p and q respectively
Association Constraint: A person from any camera can have one and only one match from another camera
\sum_j=1^n x_i,j = 1 \forall i / j = 1 to n \forall p, q = 1 to m, p < q…
accuracy = (true match + # true no-match) / # of unique people in the testset
Variation of training accuracy with k. Camera pairs (1-2 and 3-4) with all the same persons show a decreasing trend…
Re-identification with Big Data
Re-identification is addressed as querying with a probe to get a match from a gallery set. e.g. face recognition, scene classification
- Challenge 1: Uncontrolled multisensor data
- Challenge 2: Too huge to be fully labeled.
Humans are good at recognizing persons. So involve human in the loop
But manual labor is costly! So,
Involve human for difficult cases only and don’t involve human in recognizing same persons repeatly.
Unlabeled images -> Representative Selection -> Representatives -> Human labeler -> Labels of Representatives -> Restrict redundancy
We want to label a small set of ‘k << n’ representatives
Labeling them are most informative with little effort
i-LIDS-VID-2 cameras in an airport arrival hall 300 people