Shuo Li presented a talk at GVIL weekly seminar: PointNet, PointNet++, and PU-Net


Instead of 3D convolution, PointNet directly consumes point clouds, which well respects the permutation invariance of points in the input.

A point cloud is an unordered set of vectors. Each point Pi is a vector of its (x, y, z) coordinate plus extra feature channels such as color, normal etc.

The proposed deep network outputs k scores for all the k candidate classes. Our model will output n × m scores for each of the n points and each of the m semantic subcategories.

Our input is a subset of points from an Euclidean space

Unordered. Unlike pixel arrays in images or voxel arrays in volumetric grids, point cloud is a set of points without specific order. In other words, a network that consumes N 3D point sets needs to be invariant to N! permutations of the input set in data feeding order.
Interaction among points. The points are from a space with a distance metric. It means that points are not isolated, and neighboring points form a meaningful subset. Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures.
Invariance under transformations. As a geometric object, the learned representation of the point set
should be invariant to certain transformations. For example, rotating and translating points all together should not modify the global point cloud category nor the segmentation of the points.

The max pooling layer works as a symmetric function to aggregate information from all the points.

The local and global information combination structure , and  two joint alignment networks that align both input points and point features.


However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. We introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. 


The key idea is to learn multilevel features per point and expand the point set via a multibranch convolution unit implicitly in feature space. The expanded
feature is then split to a multitude of features, which are then reconstructed to an upsampled point set.



Many of the top papers here use multiview projection and solve the 3D point cloud classification in 2D space.

Algorithm ModelNet40
SO-Net[34] 93.4%   95.7%  
Minto et al.[33] 89.3%   93.6%  
RotationNet[32] 97.37%   98.46%  
LonchaNet[31]     94.37  
Achlioptas et al. [30] 84.5%   95.4%  
PANORAMA-ENN [29] 95.56% 86.34% 96.85% 93.28%
3D-A-Nets [28] 90.5% 80.1%    
Soltani et al. [27] 82.10%      
Arvind et al. [26] 86.50%      
LonchaNet [25]     94.37%  
3DmFV-Net [24] 91.6%   95.2%  
Zanuttigh and Minto [23] 87.8%   91.5%  
Wang et al. [22] 93.8%      
ECC [21] 83.2%   90.0%  
PANORAMA-NN [20] 90.7% 83.5% 91.1% 87.4%
MVCNN-MultiRes [19] 91.4%      
FPNN [18] 88.4%      
PointNet[17] 89.2%      
Klokov and Lempitsky[16] 91.8%   94.0%  
LightNet[15] 88.93%   93.94%  
Xu and Todorovic[14] 81.26%   88.00%  
Geometry Image [13] 83.9% 51.3% 88.4% 74.9%
Set-convolution [11] 90%      
PointNet [12]     77.6%  
3D-GAN [10] 83.3%   91.0%  
VRN Ensemble [9] 95.54%   97.14%  
ORION [8]     93.8%  
FusionNet [7] 90.8%   93.11%  
Pairwise [6] 90.7%   92.8%  
MVCNN [3] 90.1% 79.5%    
GIFT [5] 83.10% 81.94% 92.35% 91.12%
VoxNet [2] 83%   92%  
DeepPano [4] 77.63% 76.81% 85.45% 84.18%
3DShapeNets [1] 77% 49.2% 83.5% 68.3%