CloudCV: Object Detection

Benchmarks and challenges like PASCAL VOC and ImageNet have played a crucial role in advancing computer vision algorithms. However, with minor exceptions, such challenges also result in massive duplication of effort, with each research group developing its own infrastructure and code-base. In fact, warnings of fragmentation and lack of code reuse have repeatedly been among the top observations by forward-looking NSF funded workshops [1,2].

CloudCV can help unify fragmented efforts by being a unified data and code repository.


ImageNet Detection

  1. Deformable Parts Model (DPM) [19] models (without bounding box prediction, context) for all 200 object categories in the ILSVRC2013 Detection Challenge trained on the 'train' set
    ImageNet-Train DPM Models
    ImageNet-Train DPM Models with bounding box prediction

  2. DPM models (without bounding box prediction, context) for all 200 object categories in the ILSVRC2013 Detection Challenge trained on the 'trainval' set
    ImageNet-TrainVal DPM Models
    ImageNet-TrainVal DPM Models with bounding box prediction

    File containing detection A.P values shown in the above figures

  3. Comparison between models trained on PASCAL 2010 'trainval' set and tested on PASCAL 2010 'test' set & models trained on ILSVRC2013 Detection Challenge 'train' set and tested on ILSVRC2013 Detection Challenge 'val' set for the 16 common categories.
  4. Top 100 most confident and top 100 least confident detections of each category trained on the 'train' set and tested on the validation set.
  5. Comparison between models trained on PASCAL 2010 'trainval' set and tested on PASCAL 2010 'test' set & models trained on ILSVRC2013 Detection Challenge 'trainval' set and tested on ILSVRC2013 Detection Challenge 'test' set for the 16 common categories.

Features for ILSVRC2014 classification and localization tasks

We provide cached versions of 16 popular features for all 1.2 million images in the ILSVRC2014 classification and localization dataset. (This dataset has remained unchanged from ILSVRC2012 and ILSVRC2013.) Other features and datasets coming soon.

  1. Decaf
    Deep Convolutional Activation Features of [20]. The detailed network structure can be found on the Decaf homepage.
    Train Features: Download | Size = 43.2GB
    Test Features: Download | Size = 3.4GB
    Val Features: Download | Size = 1.7GB

  2. Decaf (With Center Only Option)
    Deep Convolutional Activation Features of [20]. The center only option was used. The detailed network structure can be found on the Decaf homepage.
    Train Features: Download | Size = 5.1GB
    Test Features: Download | Size = 340MB
    Val Features: Download | Size = 170MB

  3. GIST (gist):
    The GIST descriptor [4] computes the output energy of a bank of 24 filters. The filters are Gabor-like filters tuned to 8 orientations at 4 different scales. The square output of each filter is then averaged on a 4x4 grid.
    Train Features: Download | Size = 2.8GB
    Test Features: Download | Size 222MB
    Val Features: Download | Size = 111MB

  4. GIST padding (gistPadding):
    This same as the gist feature but the differene in this case is that the borders of the input image are padded to avoid border artifacts.
    Train Features: Download | Size = 2.8GB
    Test Features: Download | Size 223MB
    Val Features: Download | Size = 112MB

  5. HOG2x2 (hog2x2):
    First, histogram of oriented edges (HOG) descriptors [5] are densely extracted on a regular grid at steps of 8 pixels. HOG features are computed using the code available online provided by [6], which gives a 31-dimension descriptor for each node of the grid. Then, 2x2 neighboring HOG descriptors are stacked together to form a descriptor with 124 dimensions.
    Train Features: Download | Size = 4.5GB
    Test Features: Download | Size =378MB
    Val Features: Download | Size = 189MB

  6. Dense SIFT (denseSIFT):
    SIFT descriptors are densely extracted [7] using a flat window at two scales (4 and 8 pixel radii) on a regular grid at steps of 5 pixels. The three descriptors are stacked together for each HSV color channels, and quantized into 300 visual words by k-means.
    Train Features: Download | Size = 6.6GB
    Test Features: Download | Size = 568MB
    Val Features: Download | Size = 284MB

  7. LBP (lbp):
    Local Binary Patterns (LBP) [8] is a texture feature based on occurrence histogram of local binary patterns.
    Train Features: Download | Size = 2.6GB
    Test Features: Download | Size =211MB
    Val Features: Download | Size 106MB

  8. Rotation invariant LBP (lbphf):
    This feature is the rotation invariant extension version [9] of LBP.
    Train Features: Download | Size = 7.2GB
    Test Features: Download | Size =581MB
    Val Features: Download | Size 291MB

  9. Sparse SIFT histograms (sparse_sift):
    As in “Video Google” [10], SIFT features are built at Hessian-affine and MSER [11] interest points. Each set of SIFTs is clustered, independently, into dictionaries of 1,000 visual words using k-means. An image is represented by two 1,000 dimension histograms where each SIFT is soft-assigned, as in [12], to its nearest cluster centers.
    Train Features: Download | Size = 6.7GB
    Test Features: Download | Size =551MB
    Val Features: Download | Size 276MB

  10. SSIM (ssim):
    Self-similarity descriptors [13] are computed on a regular grid at steps of five pixels. Each descriptor is obtained by computing the correlation map of a patch of 5x5 in a window with radius equal to 40 pixels, then quantizing it in 3 radial bins and 10 angular bins, obtaining 30 dimensional descriptor vectors. The descriptors are then quantized into 300 visual words by k-means.
    Train Features: Download | Size = 6.1GB
    Test Features: Download | Size 530MB
    Val Features: Download | Size 265MB

  11. Tiny Images (tiny_image):
    This descriptor matches images by comparing them directly in color image space by reducing their dimensions drastically as described in [14].
    Train Features: Download | Size = 13GB
    Test Features: Download | Size 1.1GB
    Val Features: Download | Size 523MB

  12. Line Features (line_hists):
    Straight lines are detected from Canny edges using the method described in Video Compass [15]. For each image two histograms are built based on the statistics of detected lines -- one with bins corresponding to line angles and one with bins corresponding to line lengths.
    Train Features: Download | Size = 1GB
    Test Features: Download | Size 85MB
    Val Features: Download | Size 42MB

  13. Texton Histograms (texton):
    A 512 entry universal texton dictionary [16] is built by clustering responses to a bank of filters with 8 orientations, 2 scales, and 2 elongations. For each image a 512-dimensional histogram is then built by assigning each pixel’s set of filter responses to the nearest texton dictionary entry.
    Train Features: Download | Size = 18GB
    Test Features: Download | Size 1.5GB
    Val Features: Download | Size 761MB

  14. Color Histograms (geo_color):
    Joint histograms of color in CIE L*a*b* color space are built for each image. These histograms have 4, 14, and 14 bins in L, a, and b respectively for a total of 784 dimensions.
    Train Features: Download | Size = 4.3GB
    Test Features: Download | Size 354MB
    Val Features: Download | Size 177MB

  15. Geometric Probability Map (geo_map):
    The geometric class probabilities are computed for image regions using the method of Hoiem et al. [17]. Only the ground, vertical, porous, and sky classes are used because they are more reliably classified. The probability maps for each class are reduced to 8x8.
    Train Features: Download | Size = 1.2GB
    Test Features: Download | Size 100MB
    Val Features: Download | Size 50MB

  16. Geometry Specific Histograms (geo_texton):
    Inspired by “Illumination Context” [18], color and texton histograms are built for each geometric class (ground, vertical, porous, and sky). Specifically, for each color and texture sample, its contribution to each histogram is weighted by the probability that it belongs to that geometric class.
    Train Features: Download | Size = 9.2GB
    Test Features: Download | Size 745MB
    Val Features: Download | Size 373MB


References

We thank the Decaf team, and Jianxiong Xiao for making their code publicly available [3,20]

[1] S. Negahdaripour and A.K. Jain. Challenges in computer vision research; future directions of research, 1991.
[2] A. Yuille and A. Oliva. Frontiers in computer vision. 2011.
[3] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. CVPR, 2010.
[4] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001.
[5] N. Dalal and B. Triggs. Histogram of oriented gradient object detection. CVPR, 2005.
[6] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, Vol. 32, No. 9, Sep. 2010.
[7] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR, 2006.
[8] T. Ojala, M. Pietik̈ainen, and T. M̈aenp̈äa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. PAMI, 2002.
[9] T. Ahonen, J. Matas, C. He, and M. Pietik̈ainen. Rotation invariant image description with local binary pattern histogram fourier features. SCIA, 2009.
[10] J. Sivic and A. Zisserman. Video data mining using configurations of viewpoint invariant regions. CVPR, 2004.
[11] J.Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 2004.
[12] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. CVPR, 2008.
[13] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. CVPR, 2007.
[14] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large database for non-parametric object and scene recognition. PAMI, 2008.
[15] J. Kosecka and W. Zhang. Video compass. ECCV, 2002.
[16] J. Mat D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. ICCV, 2001 .
[17] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 2007.
[18] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Criminisi. Photo clip art. SIGGRAPH, 2007.
[19] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. Object Detection with Discriminatively Trained Part Based Models.PAMI, 2010.
[20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng and Darrel, T. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. arXiv preprint arXiv:1310.1531, 2013.