Published 2022. 8. 10. 15:45

Impersonator++ 이해하기

인공지능/pose estimation

[논문]

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

[training & synthesis for motion imitation]

크게 세 단계로 이루어짐
style transfer, novel view 태스크를 수행할 때에는 Flow Composition만 새로 계산하게 됨
few-shot 방법론으로 입력 이미지당 personalized 된 모델을 학습

(a) Body Mesh Recovery : 입력 이미지에 대한 3D mesh 생성

(b) Flow Composition : Source ↔︎ Target 간 transformation flow를 계산하여 warped image 생성

(c) Liquid Warping GAN : 텍스처, 정체성 등 디테일 보존

[개요]

motion imitation
- Source → texture
- Target → pose
human motion imitation
- image-to-image translation based pipeline : 스켈레톤 등으로 conditioned 된 이미지를 맵핑하는 함수 학습
- ✅ warping-based pipeline : 입력 이미지를 참조 이미지의 condition으로 워핑

Liquid warping Block : 옷, 얼굴 등 source image의 디테일 보존
SMPL(Skinned Multi-Person Linear model) : 인간의 몸을 shape와 pose(관절 회전)로 분리

Attentional Liquid Warping Block (AttLWB)
- 기존 LWB : global feature에 warped source feature들을 직접 더하기 때문에 겹치는 구역에서 artifact 발생
- + Attention : 모든 multiple sources features 가운데 global features과의 유사성을 학습하여, 그 학습된 유사성과 multiple sources들의 선형 결합으로 융합
generalization → personalization
- training set 분포로 맞추는 GAN 특성상 일반화 능력이 떨어질 수 있음
- 개개의 입력 이미지에 집중할 수 있도록(personalization), few-shot adversarial learning 적용

[modules]

Body Mesh Recovery

Flow Composition

Attentional Liquid Warping GAN

“synthesizing high-fidelity human images under the desired condition”

1) 배경 합성

2) 보이지 않는 부분의 컬러를 예측

3) SMPL 복원을 통해 옷, 머리 등 픽셀 생성

Generator

GTSF (transfer stream)
- 최종 결과를 합성하는 단계
- bilinear sampler로 warped된 foreground 와 correspondence map Ct 입력
- 텍스쳐, 스타일, 컬러 등 source information을 보존하기 위해 AttLWB 사용
  - source와 target stream을 연결
  - GSID의 source feature를 섞고(blend), transfer stream GTSF에 융합시킴(fuse)

Discriminator

Attentional Liquid Warping Block

여러 개의(multiple) source 를 처리할 수 있음
- (motion imitation) multi-view inputs
- The different parts of features are aggregated into GTSF by their transformation flow independently

- 겹치는 부분(overlap area)에서 feature의 크기(magnitude)가 커져 artifact 발생
AttLWB
- 모든 multiple source features 에서 global features 과의 유사성(similarities)를 먼저 학습
- 학습된 similarities 와 multiple sources를 feature space 상에서 선형 결합(linear combination)
- SPADE : 융합된 source feature들로써 GTSF의 feature map을 denormalize → global stream

[Loss Function]

[Personalization]

one/few-shot adversarial learning to push the network to focus on each individual by several steps of fast personal adaptation
방대한 dataset으로 generator, discriminator의 사전학습 파라미터 셋팅
sn개 샘플이 제공된 각 개인 Pi에 대하여, 사전학습 모델을 fine tuning한 모델을 생성
- discriminator 파라미터의 경우 사전학습된 파라미터를 버리고 from scratch로 학습
- 오버피팅을 방지하고 소요되는 시간을 줄이기 위해 global discriminator만을 사용

[Inference for motion imitation]

[용어]

ℝ : 실수 집합
weak-perspective projection

사람의 눈은 사물이나 장면을 볼 때, 가까이 있는 것은 크게 보이고 멀이 있는 것은 작게 보인다. 우리가 입체적인 사물을 볼 때 우리가 보는 방향에 따라 모양이 달라 보인다. 입체적인 정육면체를 생각해 보자. 정면에서 봤을때는 정사각형이지만 옆에서 보거나 위에서 볼 때는 정육면체의 각이 직각을 이루지 않고 평행사변형이나 사다리꼴과 같은 모습을 볼 수 있게 된다. 이것은 우리가 입체적인 사물이나 장면을 보았을때 이러한 현상이 나타나게 되는데 이것의 원인은 가까이 있는 것은 크게 보이고 멀리 있는 것은 작게 보이는 원근감 때문에 생기는 것이다. 이 때 우리가 생각해야 하는 것은 어떤 규칙에 의해서 그렇게 보이는 지 알아야 하는 것이다. 사물이나 장면은 우리의 눈높이에 따라 직선을 그으면 소실점이라는 선이 생긴다. 혹시 전철이 다니는 선로 위에 서본 적이 있다면 평행한 두 선로를 따라 시선을 움직이다 보면 저 멀리 두 선로가 만나는 듯이 보인 적이 있을 것이다. 원근에 따라 두 선로가 한점에 모이게 되는 현상이 일어나는데 이것을 vanishing point(소실점) 이라고 한다.projection이란 이러한 3차원의 object를 화면에 표현하기 위해 2D평면으로 투영하는 것을 projection이라고 하는데 perspective projection이란 이러한 원근의 원리를 이용하여 가까운 것을 크게 멀리있는 것을 작게 그리고 vanishing point를 고려해서 projection을 하는 것을 말한다. 반대로 orthographic projection은 원근에서 중요한 역할을 하는 vanishing point를 고려하지 않고 투영하는 방식이다. 입체적인 사물을 보았을 때 그대로의 모습을 projection 하는 것이다. vanishing point를 고려하지 않아 그로 인해 생기는 왜곡된 모습이 아닌 원래 모습 그대로를 알 수 있다. weak-perspective projection은 orthographic projections과 perspective projection의 중간 이라고 볼 수 있다.

perspective projection에 대하여 : https://seo10000.tistory.com/93
Mesh → vertices & faces

autoencoder: https://deepinsight.tistory.com/126

[정리노트] [AutoEncoder의 모든것] Chap3. AutoEncoder란 무엇인가(feat. 자세히 알아보자)

AutoEncoder의 모든 것 본 포스팅은 이활석님의 'AutoEncoder의 모든 것'에 대한 강연 자료를 바탕으로 학습을 하며 정리한 문서입니다. 이활석님의 동의를 받아 출처를 밝히며 강의 자료의 일부를 인

deepinsight.tistory.com

bilinear sampling

CycleGAN

SPADE https://happy-jihye.github.io/gan/gan-9/

[Paper Review] GauGAN : Semantic Image Synthesis with Spatially-Adaptive Normalization (SPADE) 논문 분석

Semantic Image를 현실적인 image로 변환해주는 Spatially-adaptive normlization(SPADE) model에 대해 알아본다

happy-jihye.github.io

Total Variation Regularization

'인공지능 > pose estimation' 카테고리의 다른 글

EasyMocap with Openpose (0)	2022.08.13
ROMP 이해하기 (0)	2022.08.10
Vanishing Points; How to Compute Camera Orientation (0)	2022.05.11
Camera Calibration (0)	2022.05.11
How to Compute Intrinsics from Vanishing Points (0)	2022.05.11