Upside and downside of Spatial Pyramid Pooling

Spatial Pyramid Pooling (SPP) [1] is an excellent idea that does not need to resize an image before feeding to the neural network. In other words, it uses multi-level pooling to adapts multiple image’s sizes and keep the original features of them. SPP is inspired from:

In this note, I am going to show mathematic inside before porting it into tensorflow version and analyzing upside and downside of it.

Spatial Pyramid Pooling

Inside the ideal

Consider that we use n-level pooling (a pyramid) with \(a_1 \times a_1, a_2 \times a_2, ..., a_n \times a_n\) fixed output size correspondingly. Consider that we have an image with size \(h \times w\). After some convolution and pooling layer, we have a matrix features with size \(f_d \times f_h \times f_w\). Then, we apply max pooling multiple times in this matrix features with windows_size \(= \lfloor \frac{f_h}{a_i} \rfloor \times \lfloor \frac{f_w}{a_i} \rfloor\) correspondingly. Easily to see, SPP does not affect to the convolution, fully connected parameters of a neural network model.

We gather all image with the same size to a batch. After that, we train the parameters in each batch, then transfer them to another batch. This is equivalent for the testing scenario.

Tensorflow porting

def spatial_pyramid_pool(previous_conv, num_sample, previous_conv_size, out_pool_size):
    '''
    previous_conv: a tensor vector of previous convolution layer
    num_sample: an int number of image in the batch
    previous_conv_size: an int vector [height, width] of the matrix features size of previous convolution layer
    out_pool_size: a int vector of expected output size of max pooling layer
    
    returns: a tensor vector with shape [1 x n] is the concentration of multi-level pooling
    '''

    for i in range(len(out_pool_size)):
        h_strd = previous_conv_size[0] / out_pool_size[i]
        w_strd = previous_conv_size[1] / out_pool_size[i]
        h_wid = previous_conv_size[0] - h_strd * out_pool_size[i] + 1
        w_wid = previous_conv_size[1] - w_strd * out_pool_size[i] + 1
        max_pool = tf.nn.max_pool(previous_conv,
                                   ksize=[1,h_wid,w_wid, 1],
                                   strides=[1,h_strd, w_strd,1],
                                   padding='VALID')
        if (i == 0):
            spp = tf.reshape(max_pool, [num_sample, -1])
        else:
            spp = tf.concat(axis=1, values=[spp, tf.reshape(max_pool, [num_sample, -1])])
    
    return spp

You can see the full code and an SPP on top of Alexnet example here.

Up-downside

  • Creative idea. No need to resize image; also keep original features of an image.
  • Take time to gather all image with the same size to a batch.
  • Not well working in the high detail requirement identification task. E.g., plant identification, rice identification. Work well on object detection task.
  • Sometimes, the loss function can not be converging when using transfer parameters. It may be because of not enough data or the hard level of the problem.

Conclusions

I have just analysis some idea of SPP. SPP is a beautiful idea that combines classic computer visions idea to the modern neural network. Pheww, hope you enjoy it. :D

References

[1] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Written on June 30, 2017