This paper presents a method to infuse spatial information in the bag of words (BOW) framework for object categorization. The main idea is to account the local spatial distribution of the visual words. Rather than finding rigid local patterns, we consider the visual words in close spatial proximity as a pouch of words and we represent the image as a bag of word-pouches. For this purpose, sub-windows are extracted from the images and characterized by local bags of words. Then a clustering step is applied in the local bag of words space to construct the word-pouches. We show that this representation is complementary to the classical BOW. Thus a concatenation of these two representations is used as the final descriptor. Experiments are conducted on two very well known image datasets.
展开▼