Notes on “FaceNet: A Unified Embedding for Face Recognition and Clustering”

Muhammed Abdullah

July 24, 2021

4 min read

tags: Face-Recognition, Triplet-Loss, Computer-Vision, Face-Verification

Authors: Florian Schroff, Dmitry Kalenichenko, James Philbin

Brief Outline

Face recognition is the problem where system identifying a person’s face from given photograph or video.

  • Face verification vs. face recognition:

    • Verification: Input: image, name/ID. (1 : 1) Output: whether the input image is that of the claimed person. “is this the claimed person?”

    • Recognition: Has a database of K persons Get an input image Output ID if the image is any of the K persons (or not recognized) “who is this person?”

Face Recognition is genarlly a one-shot learning task. One-shot learning is a classification task where the model should learn from one or few example of given class and be able to recognize it in the future. To make this work the model learns a similarity function. \(d(img1,img2)\) = degree of difference between images.

\[\begin{split}d(img1,img2) \begin{cases} \leq \tau, & \text{Same} \\ > \tau, & \text{Different} \end{cases} \end{split}\]

Introduction

  • FaceNet presented a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces) using the method based on learning a Euclidean embedding per image with a deep convolutional network.

  • The similarity of faces corresponds to the squared \(L_2\) distances of the embeddings of 128 dimensions learned using triplet loss function.

  • Once these embedding has been produced, then the next tasks become straight-forward:

    • face verification simply involves thresholding the distance between the two embeddings;

    • recognition becomes a k-NN classification problem;

    • and clustering can be achieved using off-theshelf techniques such as k-means or agglomerative clustering.

Model

  • FaceNet uses deep convolutional network. They have discussed two different core architechtures:

    1. Zeigler&Fergus Style Network

    2. Inception Network

  • The model is trained to get embedding \(f(x)\) from an image \(x\) into a feature space \(\mathbb{R}^d\), such that the squared distance (\(L^2_2\)) between the faces is small for same identity, whereas large for different identities.

  • Additionally, they constrain these embeddings (represented by \(f(x) ∈ \mathbb{R}^d\)) to live in \(d\)-dimensional hypersphere, i.e. \(||f(x)||_2 = 1\).

  • They employed Triplet Loss as the loss function.

  • Training Details:

    • Loss function: Triplet Loss

    • Batch-Size: 1,800

    • Optimizers: SGD with standard backprop and AdaGrad.

    • Learning rate: start with 0.05 then lowered.

    • \(\alpha\) = 0.2

    • Embedding dimensionality: 128-D float vector and can be easily quantized to 128 bytes. Larger embedding will require more training for same accuracy. Smaller embeddings may cause minor loss of accuracy can could be used on mobile devices.

Triplet Loss

  • The Triplet Loss minimizes the distance between an Anchor image (\(x^a_i\)) and a Positive image (\(x^p_i\)), both of which have the same identity, and maximizes the distance between the Anchor image(\(x^a_i\)) and a Negative image (\(x^n_i\)) of a different identity.

  • \(\therefore\) We want \(d(x^a_i, x^p_i) + \alpha < d(x^a_i, x^n_i)\) where \(\alpha\) is a margin enforced between positive and negative pairs to make sure the model won’t get an output of zeros easily.

        \(||f(x^a_i) - f(x^p_i)||_2^2 + \alpha < ||f(x^a_i) - f(x^n_i)||_2^2,\;\; \forall \;\; (f(x^a_i), f(x^p_i), f(x^n_i)) ∈ \tau\), \(\tau\) is a set of all possible triplets in the training set.

The Loss Function that is being minimized is         \(L = \sum_{i}^{N} [\; ||f(x^a_i) - f(x^p_i)||_2^2 \;-\; ||f(x^a_i)- f(x^n_i)||_2^2 + \alpha\;]_+\)

Triplet Selection

  • Generating all possible triplets would result in many triplets that easily satisfy the the distance condition mentioned above, thus not contributing to learning and would also slow down the convergence. These triplets are said to be Easy Triplets and will always give loss \(0\) as \(d(x^a_i, x^p_i) + \alpha < d(x^a_i, x^n_i)\)

  • Thus for faster convergence and better learning those triplets are need which violate the triplet constraint (refered as Hard Triplets) i.e. \(d(x^a_i, x^n_i) < d(x^a_i, x^p_i)\)

  • Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains.

  • The two methods discussed were:

  1. Generate triplets offline

    • For every n epochs compute the embeddings on the subset of data and and only select the hard triplets i.e \(argmax_{x^a_i}(d(x^a_i, x^p_i))\) and \(argmin_{x^a_i}(d(x^a_i, x^n_i))\).

    • Exploring offline triplet mining resulted in inconclusive results and is not very efficient.

  2. Generate triplets online

    • This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

    • Instead of just the hardest positive (\(argmax_{x^a_i}(d(x^a_i, x^p_i))\)), all anchor-positive pairs in a mini batch are picked, while still still selecting the hard negatives.

      We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchorpositive method was more stable and converged slightly faster at the beginning of training.

    • Also instead of selecting just hardest negative (\(argmin_{x^a_i}(d(x^a_i, x^n_i))\)), the negative exemplars are selected such that: \(d(x^a_i, x^p_i) < d(x^a_i, x^n_i)\). These are refered as Semi-Hard Triplets and lie inside a margin \(\alpha\).

      Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. \(f(x) = 0\)).


types of triplets
Image Credits: Omoindrot

Further Reading:

Evaluation

\(\mathcal{P}_{same}\): All pairs with same identity.

\(\mathcal{P}_{diff}\): All pairs of different identity.

Given a threshold \(h\) for image distance, the correctly classified as same are called true accepts

\(\text{TA}(h) = {(i, j) ∈ \mathcal{P}_{same}, with\; d(x_i , x_j ) ≤ h}\) and incorrectly classified as same are called false accepts

\(\text{FA}(h) = {(i, j) ∈ \mathcal{P}_{diff}, with\; d(x_i , x_j ) ≤ h}\) .

Validation Rate \(\text{VAL}(h) = \frac{|TA(h)|}{|\mathcal{P}_{same}|}\)

False accept Rate \(\text{FAR}(h) = \frac{|FA(h)|}{|\mathcal{P}_{diff}|}\)

Conclusion

  • They achieved higher accuracy than the previous state-of-the-art methods DeepFace (error reduced by factor of 7), DeepId2 (error reduced by 30%).

  • They used the same embeddings from face verification to cluster faces.

  • The important part was how the triplets were chosen for training.