[PyTorch Implementation] PointNet 설명과 코드

🌌 Deep Learning/Implementation

[PyTorch Implementation] PointNet 설명과 코드

복만 2022. 8. 12. 16:06

PointCloud 데이터를 이용한 대표적인 모델인 PointNet의 구조와 PyTorch로 구현한 코드이다.

PointNet은 Feature extraction 후 classification / segmentation을 수행할 수 있지만,

본 글에서는 classification을 위한 네트워크만 소개한다.

코드는 가장 star 수가 많은 PyTorch implementation인 아래 Github repo를 참고하여 일부 수정했다.

https://github.com/fxia22/pointnet.pytorch/tree/f0c2430b0b1529e3f76fb5d6cd6ca14be763d975

GitHub - fxia22/pointnet.pytorch: pytorch implementation for "PointNet: Deep Learning on Point Sets for 3D Classification and Se

pytorch implementation for "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" https://arxiv.org/abs/1612.00593 - GitHub - fxia22/pointnet.pytorch: pytorch imp...

github.com

PointNet

Paper link, Slide

PointNet은 PointCloud 데이터를 Voxel grid 형태로 만들지 않고, PointCloud 형태에서 바로 feature extraction을 하는 모델이다.

PointCloud 데이터에서 feature extraction을 하기 위해서는 다음 세 가지를 고려해야 한다.

1. Input의 순서에 invariant해야 한다.

[pointA, pointB, pointC] 순서의 input이나, [pointB, pointA, pointC] 순서의 input이나, 모두 같은 object를 가리키는 데이터이기 때문에 항상 결과를 내야 한다는 것이다.
다시 말해, 모델이 input permutation에 invariant해야 하며, 이를 위해 symmetric function을 이용해 각 point에서의 정보를 결합해야 한다.
Symmetric function이란, n개의 vector을 input으로 받아, input order에 invariant한 새로운 vector을 출력하는 함수를 말한다. 더하기, 곱하기 등이 여기에 속한다.
PointNet에서는 다음과 같은 symmetric function을 정의하여 사용한다.
- $f(\{x_1,..., x_n\})=\gamma \circ g(h(x_1),...,h(x_n))$
  - 위 식은 $g$가 symmetric일 시 symmetric function이다.
  - $h$: MLP
  - $g$: max pooling
  - $\gamma$: MLP

2. Local과 Global한 정보의 결합

이 부분은 segmentation network에만 해당되는 항목이라 pass

3. Geometric transformation에 invariant해야 한다.

Input의 순서 뿐만 아니라, linear transformation에도 invariant해야 한다.
Geometric transformation에 invariant한 representation을 만들기 위해, canonical space로의 mapping을 위한 affine transformation parameter을 학습한다.
- 여기서 canonical space란, linear transformation을 가해도 변형되지 않는 기저공간이며, 이 공간으로의 매핑을 해주는 transformation을 학습하는 것으로 이해했다.
transformation을 위한 parameter을 학습하는 네트워크를 T-Net이라 하고, 여기서 학습한 transformation matrix를 input feature에 곱해주는 것으로 mapping을 수행한다 (matrix multiply).
- T-Net은 shared MLP와 maxpooling, fc layer들로 구성되어, NxC 크기의 input을 받아 CxC 크기의 transformation matrix를 출력한다.
- 이를 input에 곱해 transformation을 수행해준다.
이를 input image와 중간 feature에 대해 두 번 적용한다.
이때 중간 feature의 경우 64x64 size의 transformation matrix를 예측해야 하므로, 차원이 너무 커서 최적화하기 힘들다. 따라서 이 경우에는 regularization term을 추가한다.
- $L_{reg}=||I-AA^T||^2_F$