scene graph and visual relation初探

场景图和视觉关系主要用于理解图像，可以作为caption、text2image、retrival的基础。基本思路是先从纯视觉的detection方法从image中选出proposal和对应的label，再根据这些信息构建scene graph

梳理一下近期看过的scene graph generation论文

1. Graph R-CNN for Scene Graph Generation

此领域目前的SOTA模型，ECCV 2018投稿。本文模型包括一个Relation proposal network（RePN）,能够高效的解决物体间两两的联系随着物体数量以平方的形式增长的问题，本文还提出了一个attention graph convolutional network（aGCN），能够高效抓取物体和relation之间的相互联系。最后一点贡献是提出了一个结果衡量方式，这个衡量方式相比于现有的衡量标准更加全面，更加实际。最后作者一句话概括说自己的模型在实验中无论是用现有的衡量标准，还是新提出来的衡量标准，都取得了state-of-the-art的结果。

给定image，提出proposal，然后假定各proposal之间都有关系，之后用RePN筛选relation，使graph变稀疏。最后再通过一个图卷机网络整合信息，并且更新object node和relationship edge的标签。

将一个image表示为I，V表示一组nodes，每一个node对应I中识别出来的一个object的区域，E表示边的集合，O，R表示object和relation，则可以表示为$P(S=(V, E, O, R) | I)$，本文将分解成三个部分：

$P(\mathcal{S} | \boldsymbol{I})=\overbrace{P(\boldsymbol{V} | \boldsymbol{I})}^{\text {Object Region }} \underbrace{P(\boldsymbol{E} | \boldsymbol{V}, \boldsymbol{I})}_{\text {Relationship } \atop \text { Proposal }} \overbrace{P(\boldsymbol{R}, \boldsymbol{O} | \boldsymbol{V}, \boldsymbol{E}, \boldsymbol{I})}^{\text {Graph Labeling }}$

论文地址：https://arxiv.org/pdf/1808.00191.pdf

2. Neural Motifs: Scene Graph Parsing with Global Context

文中的方法先将图片做proposal，依次通过一个双向LSTM，得到object context（c），并作为全局信息传递，c再输入LSTM解码得到label。将c和label同时输入一个双向LSTM得到edge context（d），并作为全局信息传递，d的全连接根据motif有不同的bias，得出可能的关系预测。

$\mathbf{C}=\operatorname{biLSTM}\left(\left[\mathbf{f}_{i} ; \mathbf{W}_{1} \mathbf{l}_{i}\right]_{i=1, \ldots, n}\right)$ $\begin{array}{l}{\mathbf{h}_{i}=\operatorname{LSTM}_{i}\left(\left[\mathbf{c}_{i} ; \mathbf{\hat { o }}_{i-1}\right]\right)} \\ {\hat{\mathbf{o}}_{i}=\operatorname{argmax}\left(\mathbf{W}_{o} \mathbf{h}_{i}\right) \in \mathbb{R}^{|\mathcal{C}|}(\text { one-hot })}\end{array}$ $\mathbf{D}=\operatorname{biLSTM}\left(\left[\mathbf{c}_{i} ; \mathbf{W}_{2} \hat{\mathbf{o}}_{i}\right]_{i=1, \ldots, n}\right)$

论文地址：https://rowanzellers.com/neuralmotifs