跳转至

MobileFaceSwap

arXiv | PaperWithCode | GitHub [飞桨]

标题 Title

MobileFaceSwap: A Lightweight Framework for Video Face Swapping

轻量、视频换脸。

摘要 Abstract

Advanced face swapping methods have achieved appealing results. However, most of these methods have many parameters and computations, which makes it challenging to apply them in real-time applications or deploy them on edge devices like mobile phones.

现在的换脸模型效果已经挺好了,但它们参数太多了,难以用在移动终端的实时场景中。

In this work, we propose a lightweight Identity-aware Dynamic Network (IDN) for subject-agnostic face swapping by dynamically adjusting the model parameters according to the identity information.

  • Identity-aware Dynamic Network,IDN:
  • subject-agnostic:不定主角的,也就是 One-shot 的换脸

In particular, we design an efficient Identity Injection Module (IIM) by introducing two dynamic neural network techniques, including the weights prediction and weights modulation. Once the IDN is updated, it can be applied to swap faces given any target image or video.

一个高效的身份植入模块 IIM,包括两个船新的神经网络技术:

  • weights prediction
  • weights modulation

The presented IDN contains only 0.50M parameters and needs 0.33G FLOPs per frame, making it capable for real-time video face swapping on mobile phones.

我们的 IDN 参数很少。

In addition, we introduce a knowledge distillation-based method for stable training, and a loss reweighting module is employed to obtain better synthesized results.

knowledge distillation,知识蒸馏

  • loss reweighting module:为了生成更好的结果

Finally, our method achieves comparable results with the teacher models and other state-of-the-art methods.

效果不错。

结论 Conclusion

In this work, we propose MobileFaceSwap for real-time subject-agnostic face swapping. We design an efficient Identity Injection Module (IIM) to adjust the parameters of the Identity-aware Dynamic Network (IDN) adaptively.

IIM 是动态调节 IDN 参数的施力者。

Then, we use the knowledge distillation and design a loss reweighting module to obtain better swapped results. Our method can be deployed on mobile phones, perform real-time face swapping.

知识蒸馏和损失重估模块来获得更好的换脸结果。我们的模型可以部署在手机上,实时换脸。

Besides, we can generate some forgery samples by MobileFaceSwap and hope these will have a little impact on forgery detection, as some forgery techniques are likely to be abused for malicious purposes.

forgery n. 伪造。一些安全方面的担忧与帮助,其实是在给自己开脱、扯淡。

图表 Figures

image-20220731111207496

作者说这个帧率是在天玑 1100 上测出来的,也不知道用了什么来加速。看这个括号 SimSwap 什么的,作者提出的模型其实是在 SimSwap 等现有模型的基础上裁剪而来?然后这个 Teacher 又是啥意思

解答:知识蒸馏这词一出,Teacher 就是指这个过程中用来教育 MobileFaceSwap 的教师模型。而并非裁剪出来。

解答:飞桨 Lite 自带一些个 GPU/NPU 的支持,应该是用了 MTK 的 GPU 加速。

导言 Introduction

Target vs. Source

Source Image 是钢铁侠的脸,就是最终结果看起来像的那个纹理脸。

Target Image 是我的脸,是 Driving Image,是动作/表情的提供者。

灵感初现

A natural idea to address the challenges above is using model compression techniques to produce a lightweight network for face swapping.

我们试图通过「模型压缩」的办法减少开销,但效果骤降。artifact, n. 人造品。

Inspired by the subject-aware face swapping technique (Perov et al. 2020)

受 Deepfacelab 这种针对性训练式模型的启发。如果我们的模型是针对某一个目标人物训练的,那效果又好、参数又少。

Therefore, to achieve subject-agnostic and real-time face swapping, an intuitive idea is to adjust the parameters of a neural network according to the identity information.

但我们要达成 One-shot 的目标,因此我们需要通过这一张图来改变网络的参数,达成上述效果。

设计网络

Inspired by the dynamic neural network techniques, we propose a lightweight Identity-aware Dynamic Network (IDN) for real time face swapping.

略。

To efficiently inject identity information, we also design an Identity Injection Module (IIM) using weights prediction (De Brabandere et al. 2016) and weights modulation (Karras et al. 2020) to adjust the parameters of IDN. In this way, the IDN can be updated given the needed identity information,

上面有提到的两个名词。

Without further optimization such as quantization,

略。

知识蒸馏

Generally, training a neural network for face swapping is unstable, and the generated images may have obviously artifacts.

我们使用知识蒸馏来使得训练过程更稳定。知识蒸馏能将知识从大模型中迁移到小模型上。

知识蒸馏是一个老师-学生机制。老师模型是一些巨大的、已经训练好的模型。学生模型是我们的小模型。

However, the teacher model may also produce some failure cases, such as the generated image having a low identity similarity with the source image. The student model can be misled by these failure cases and produce suboptimal results.

mislead vt. 将…带错方向。alleviate vt. 减轻。distillation n. 蒸馏。

老师模型也不是十全十美,他们错误的输出会误导学生模型。本文搞了个损失重估模块,让学生别太受错误输出的影响。

主要贡献

本文的主要贡献如下:

We propose a real-time framework for video face swapping. It contains only 0.50M parameters and 0.33G FLOPs, and arrives at 26 FPS on the mobile phone.

We present an Identity Injection Module (IIM), which utilizes the weights prediction and weights modulation for more efficient identity information injection to build an Identity-aware Dynamic Network (IDN).

To stabilize the learning process, we train the proposed network using a knowledge distillation framework and propose a loss reweighting module to improve the generated results qualitatively and quantitatively.

深度换脸 Face Swapping

  • subject-aware,主角明确的/针对性训练式
    • DeepFaceLab:编码器-解码器结构
  • subject-agnostic,主角不定的/One-shot 式
    • source-oriented:先把我的表情粘到钢铁侠的脸上(扭一扭吧可能是),然后再把钢铁侠的脸剪下来贴到我的头上。Yuval等均属于此类。这类方法生成出来的效果有点假。
    • target-oriented:通过神经网络将钢铁侠的特征融进我的脸上。FaceShifter、SimSwap、FaceController 和 HifiFace 等均属于此类。但这类方法开销太大了。

blend vt. 混合。reenactment n. 再制定, 再演出。prone to, 倾向于。

原来 FaceController 是自己做的,绷不住了。

动态神经网络 Dynamic Neural Networks

A dynamic neural network refers to one that adapts its structure or parameters to the input during inference, which can result in greater computational efficiency.

推理阶段仍能通过一些输入改变网络参数,从而获得更高的计算效率。

知识蒸馏 Knowledge Distillation

第一,用于模型压缩。第二,可以让非配对式训练过程更稳定。

方法 Method

image-20220731120438348

Figure 2: MobileFaceSwap framework: (a) The overall training process. (b) The Identity Injection Network (IIN) of MobileFaceSwap, which contains several Identity Injection Modules (IIM) and utilizes the identity information to predict or modulate the weights of the IDN. (c) The architecture of the Identity-aware Dynamic Network (IDN) and a weakly semantic fusion module for face swapping, that contain only 0.50M parameters and 0.33G FLOPs in total.

图 A 描述了训练过程,不急着看。

图 B 说的是 IIN,就是那个改变 IDN 参数那个操作员。它从钢铁侠里面抽取特征,逐层地对 IDN 的参数进行修改,拧螺丝。

图 C 是 IDN。这是最终的生成网络,好像是一堆 CNN。

网络结构 Network Architecture

身份信息由 ArcFace 抽取而来。

身份相关的动态网络 IDN

IDN 是从 U-Net 精简而来的。常规卷积被换成了开销更小的 Depthwise 和 Pointwise 卷积,具体可以参考知乎。对应的,weights prediction 为 depthwise 卷积而生、weights modulation 为 pointwise 卷积而生。

深卷积 Depthwise Convolution
Note

weights prediction 为 depthwise 卷积而生。

We utilize the identity embedding...

点卷积
Note

Also, we observe that it can obtain better results...

语义融合模块 Semantic Fusion Module

用来维持背景等不应该被动的地方与我的图片(Target Image)保持一致。

训练目标 Training Objectives

知识蒸馏

Generally, training a face swapping network requires many loss functions to guarantee that the generated result meets the definition of face swapping. The competition of these different losses makes the training process unstable and easier to generate artifacts as there is no paired ground truth for the constraint.

loss 太多样了,训练过程优化的时候互相打架从而导致不稳定。对于这一问题,我们把原本 unpaired 的训练目标改成了 paired 了,通过知识蒸馏的方式。

损失重整

However, the teacher is not perfect and some bad cases can be found in the teacher outputs.

老师模型也有犯错的时候。两种失败情景:第一是它没能保持钢铁侠的身份信息;第二是输出结果太不自然或者一眼假了。于是我们引入 loss reweighting 模块来降低失败样本对学生模型学习过程的影响。

Specifically, we use the square of the cosine distance between the identity representations of the teacher output and source image to measure the identity similarity.

像不像的问题,我们用余弦距离来评估。越像钢铁侠,它们之间的余弦距离会越接近 1,反之趋于 0.

Finally, we employ this model Q to evaluate the image quality of the teacher outputs.

成片效果由一个基于 ResNet 的模型评估,得到一个分数。分数越大,效果越好。

最终的样本权重系数如下式所示:

\[ \alpha = \mathrm{cosim}(z_{id}, z_{id}^{'})^2 ×Q(I_g^{'}) \]

实验 Experiments

定性指标 Qualitative Results

列了一些图片,自个儿看呗。

定量指标 Quantitative Results

怎么算出来的

  • Id:CosFace 算出来的余弦相似度,越大代表越像。
  • Pose:由 pose estimater 算出来的 L2 距离,评估头部姿态的相似度。越小越好。
  • FID:语义信息,数值越低表明背景语义信息保留得越完整。

消融实验 Ablation Study

image-20220731141828734

上半部分旨在说明网络结构的优势;下半部分旨在说明训练目标(损失函数等)选择的优势。

稍微举几个例子吧。比如去掉 Semantic Module 之后,FID 显著变坏,代表背景之类的语义信息也被破坏了。再比如我们去掉 Id Loss 之后,训练只关注背景不损失去了,于是背景虽然保留得挺好的,脸却没怎么换好——这都是意料之中的了。