Hello, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

About two-algo-face-rotator about talking-head-anime-demo HOT 1 CLOSED

pkhungurn commented on May 17, 2024

About two-algo-face-rotator

from talking-head-anime-demo.

Comments (1)

pkhungurn commented on May 17, 2024 1

The whole snippet that you quoted, plus the one line of code above it, is an implementation of Zhou et al.'s appearance flow algorithm. I reproduce the snippet in full below:

    grid_change = torch.transpose(self.zhou_grid_change(y).view(n, 2, h * w), 1, 2).view(n, h, w, 2)
    device = self.zhou_grid_change.weight.device
    identity = torch.Tensor([[1, 0, 0], [0, 1, 0]]).to(device).unsqueeze(0).repeat(n, 1, 1)
    base_grid = affine_grid(identity, [n, c, h, w])
    grid = base_grid + grid_change
    resampled = grid_sample(image, grid, mode='bilinear', padding_mode='border')

As a result, the affine_grid function does not corresponds to the implementation of the appearance flow algorithm. However, it is a part of that implementation as it is one of the six lines of the snippet above.

I think that you might be confused because you mentioned that "affine_grid [...] builds a Spatial Transformer Networks." This is not true. What affine_grid does is that it creates a flow field that acts like an affine transformation given as one of its argument. Jaderberg et al. have a part of their Spatial Transformer Networks learn to predict this argument. The flow field produced by affine_grid is then fed to grid_sample, which finally produces a resampled image for further processing. As a result, the Pytorch documentation for affine_grid says:

"This function is often used in conjunction with grid_sample() to build Spatial Transformer Networks ."

Note here that it is "used [...] to build Spatial Transformer Networks." So, it is a part of Spatial Transformer Networks. However, it does not "build" Spatial Transformer Networks.

Moreover, even if affine_grid is a part of Spatial Tarnsformer Networks, nothing prevents it from being used to build other networks, including Zhou et al.'s.

This brings us to the main difference between Zhou et al.'s paper and Jaderberg et al.'s paper. Zhou et al.'s predicts the whole flow field. However, Jaderberg et al.'s predicts the transformations (being affine ones or thin plate splines) that are used to create the flow field.

My snippet implements Zhou et al.'s algorithm because it also predicts the whole flow field. That is the "grid" variable. The way it computes the grid variable is not direct, and this might have caused your confusion. It starts with the "base_grid," which represents a flow field that copies all the pixel to the same location. The base_grid is created by calling affine_grid with a fixed transformation (the identity). The fact that this transformation is not learned means that my snippet does not implement a Spatial Transformer Network.

The main prediction is when it computes the grid_change variable on the first line. The grid_change acts as an offset to the base_grid, and so, after these two are added together, the whole flow field is computed. I did this because a lot of pixels in the input image (especially those of the character's body) would remain unchanged, so it would be easier for the network to learn the offsets rather than creating the whole flow field from scratch.

I hope this answer your questions.

from talking-head-anime-demo.

About two-algo-face-rotator about talking-head-anime-demo HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent