Upside and downside of Spatial Pyramid Pooling

Spatial Pyramid Pooling (SPP) [1] is an excellent idea that does not need to resize an image before feeding to the neural network. In other words, it uses multi-level pooling to adapts multiple image’s sizes and keep the original features of them. SPP is inspired from:

In this note, I am going to show mathematic inside before porting it into tensorflow version and analyzing upside and downside of it.

Spatial Pyramid Pooling

Inside the ideal

Consider that we use n-level pooling (a pyramid) with \(a_1 \times a_1, a_2 \times a_2, ..., a_n \times a_n\) fixed output size correspondingly. Consider that we have an image with size \(h \times w\). After some convolution and pooling layer, we have a matrix features with size \(f_d \times f_h \times f_w\). Then, we apply max pooling multiple times in this matrix features with windows_size \(= \lfloor \frac{f_h}{a_i} \rfloor \times \lfloor \frac{f_w}{a_i} \rfloor\) correspondingly. Easily to see, SPP does not affect to the convolution, fully connected parameters of a neural network model.

We gather all image with the same size to a batch. After that, we train the parameters in each batch, then transfer them to another batch. This is equivalent for the testing scenario.

Tensorflow porting

def spatial_pyramid_pool(previous_conv, num_sample, previous_conv_size, out_pool_size):
    '''
    previous_conv: a tensor vector of previous convolution layer
    num_sample: an int number of image in the batch
    previous_conv_size: an int vector [height, width] of the matrix features size of previous convolution layer
    out_pool_size: a int vector of expected output size of max pooling layer
    
    returns: a tensor vector with shape [1 x n] is the concentration of multi-level pooling
    '''

    for i in range(len(out_pool_size)):
        h_strd = previous_conv_size[0] / out_pool_size[i]
        w_strd = previous_conv_size[1] / out_pool_size[i]
        h_wid = previous_conv_size[0] - h_strd * out_pool_size[i] + 1
        w_wid = previous_conv_size[1] - w_strd * out_pool_size[i] + 1
        max_pool = tf.nn.max_pool(previous_conv,
                                   ksize=[1,h_wid,w_wid, 1],
                                   strides=[1,h_strd, w_strd,1],
                                   padding='VALID')
        if (i == 0):
            spp = tf.reshape(max_pool, [num_sample, -1])
        else:
            spp = tf.concat(axis=1, values=[spp, tf.reshape(max_pool, [num_sample, -1])])
    
    return spp

You can see the full code and an SPP on top of Alexnet example here.

Up-downside

Creative idea. No need to resize image; also keep original features of an image.
Take time to gather all image with the same size to a batch.
Not well working in the high detail requirement identification task. E.g., plant identification, rice identification. Work well on object detection task.
Sometimes, the loss function can not be converging when using transfer parameters. It may be because of not enough data or the hard level of the problem.

Conclusions

I have just analysis some idea of SPP. SPP is a beautiful idea that combines classic computer visions idea to the modern neural network. Pheww, hope you enjoy it. :D

References

[1] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Written on June 30, 2017