【CVPR2022 oral】MixFormer: Mixing Features across Windows and Dimensions
时间:2022-08-14 03:30:01
论文:https://arxiv.org/pdf/2204.02557.pdf
代码:https://github.com/PaddlePaddle/PaddleClas
论文作者陈强在ReadPaper网站上有解读,推荐阅读:https://readpaper.com/paper/669120545282228224
1、研究动机
本文旨在解决基于局部窗口的自我注意(local-window Self-attention)应用于视觉任务中存在的两个问题:(1)感觉野受限;(2)通道维度建模能力较弱。
局部窗口的注意力有一些解决方案:HaloNet和Focal Transformer使用Expanding扩大感觉野的方式;Swin Transformer采用Shift扩大感觉野的操作;Shuffle Transformer则引入Shuffle完成局部窗口之间的交互。
本论文有两个贡献:
- 平行分支设计(parallel design),局部自注(local-window Self-attention)卷积与通道分离(depth-wise Convolution)结合局部窗口信息,扩大感觉野。
- 根据不同分支操作共享参数的不同维度,在平行分支之间提出了双向交互模块(bi-directional interaction),整合不同维度的信息,提高模型在各个维度的建模能力。
2、主要方法
论文的核心是提出论文 mixing block,结构如下图所示。CNN与local window attention 双向特征交互并行设计在中间。
双向特征交互的细节如下图所示,其本质是应用SENet产生权重,加权其他分支特征。值得注意的是,DwConv分支是给Channel权重的维度分配,W-Attention分支是给 HW权重的维度分配。
作者使用的网络架构如下图所示,最后作者使用FC层将特征转化为H/32 x W/32 x 最终转化为1280 classification head,最后,完全连接到分类类别。
不再介绍实验部分,可参考作者论文。
3、一些思考
这篇论文的思路很简单,但是DwConv模块和 Attention 作者可以并行串行模块 考虑附加材料。
另外,在附加材料中 discussion with related works 很有意思,提到了很多类似的工作:
In MixFormer, we consider two types of information exchanges: (1) across dimensions, (2) across windows.
- For the first type, Conformer [39] also performs information exchange between a transformer branch and a convolution branch. While its motivation is different from ours. Conformer aims to couple local and global features across convolution and transformer branches. MixFormer uses channel and spatial interactions to address the weak modeling ability issues caused by weight sharing on the channel (local-window self-attention) and the spatial (depth-wise convolution) dimensions [17].
- For the second type, Twins (strided convolution global sub-sampled attention) [6] and Shuffle Transformer (neighbor-window connection (NWC) random spatial shuffle) [26] construct local and global connections to achieve information exchanges, MSG Transformer (channel shuffle on extra MSG tokens) [12] applies global connection. Our MixFormer achieves this goal by concatenating the parallel features: the non-overlapped window feature and the local-connected feature (output of the dwconv3x3).