Words Matter: Scene Text for Image Classification and Retrieval

Words Matter: Scene Text for Image Classification and Retrieval


Text in natural images typically adds meaning to an object or scene. In particular, text specifies which business places serve drinks (e.g. cafe, tea house) or food (e.g. restaurant,pizzeria), and what kind of service is provided (e.g. massage,repair). The mere presence of text, its words and meaning are closely related to the semantics of the object or scene. This paper exploits textual contents in images for fine-grained business place classification and logo retrieval. There are four main contributions. First, we show that the textual cues extracted by the proposed method are effective for the two tasks. Combining the proposed textual and visual cues outperforms visual only classification and retrieval by a large margin. Second, to extract the textual cues, a generic and fully unsupervised word box proposal method is introduced. The method reaches state-of-the art word detection recall with a limited number of proposals.Third, contrary to what is widely acknowledged in text detection literature, we demonstrate that high recall in word detection is more important than high f-score at least for both tasks considered in this work. Last, this paper provides a large annotated text detection data set with 10K images and 27601 word boxes.

PROJECT OUTPUT VIDEO: (Click the below link to see the project output video):


  • Most of the time, the stores use text to indicate what type of food (pizzeria, diner), drink (tea, coffee) and service(dry cleaning, repair) that they provide. This text information is helpful even for human observers to understand what type of business place it is. For instance, the images of two different business places (pizzeria and bakery) have a very similar appearance. However, they are different types of business places.
  • It is only possible with text information to identify what type of business places these are. Moreover,text is also useful to identify similar products (logo) suchas Heineken, Foster and Carlsberg.
  • The common approach to text recognition in images is todetect text first before they can be recognized. Thestate-of-the-art word detection methods focus on obtaining a high f-score by balancing precisionand recall.
  • Existing word detectionmethods usually follow a bottom-up approach. Character candidatesare computed by a connected componentor a sliding window approach.
  • Candidate character regions are further verified and combinedto form word candidates. This is done by using geometric,structural and appearance properties of text and is based onhand-crafted rules or learning schemes


  • Unfortunately, there exists no single best method for detecting words with high recall due to large variations in text style, size and orientation.
  • Weak classifiers are used.
  • Poor F-Score.


  • In this paper, we focus on classification of different business places, e.g., bakery, cafe and bookstore. Various business places have subtle differences in visual appearances.
  • We exploitthe recognized text in images for fine-grained classificationof business places. Automatic recognition and indexing ofbusiness places will be useful in many practical scenarios.
  • We propose amultimodal approach which uses recognized text and visualcues for fine-grained classification and logo retrieval.
  • We propose to combine character candidates generated by different state-of-the-art detection methods. To obtain robustness against varying imaging conditions, we use color spaces containing photometric invariant properties such as robustness against shadows, highlights and specular reflections.
  • The proposed method computes text lines and generates word box proposals based on the character candidates. Then, word box proposals are used as input of a state-of-the-art word recognition method to yield textual cues. Finally, textual cues are combined with visual cues for fine-grained classification and logo retrieval.


  • For instance, it can be used to extract information from Google street view images and Google Map can use the information to provide recommendations of bakeries, restaurants close to the location of the user.
  • Instead of using the f-score, our aimis to obtain a high recall. A high recall is required becausetextual cues that are not detected will not be considered inthe next (recognition) phase of the framework.
  • The proposedmethod reaches state-of-the-art results on both tasks. Second,to extract the word-level textual cues, a generic, efficientand fully unsupervised word proposal method is introduced.The proposed method reaches state-of-the-art word detectionrecall with a limited number of proposals. Third, contrary towhat is widely acknowledged in text detection literature, weexperimentally show high recall in word detection is moreimportant than high f-score at least for both applicationsconsidered in this work.



  1. Word Level Textual Cue Encoding
  2. Visual Cue Encoding
  3. Classification and Retrieval


Word Level Textual Cue Encoding:

It having following steps,

  1. Image Acquisition
  2. Color Channel Generation
  3. Character Detection
  4. Word Proposal Generation & Word Recognition
  5. Image Acquisition:

Images are acquired from Gallery.

  1. Color Channel Generation:

In that stage, RGB image is converted into HSV image. After that Hue, Saturation and Intensity channels are extracted for further process.  Especially intensity channel is used for character detection.

  1. Character Detection:

For character detection, we proposed two methods such as MSER region detection and text saliency generation. V channel is used for MSER region detection. In that text region is not detected properly. Another method is saliency map generation for text detection. Finally text saliency was extracted.

  1. Word Proposal Generation & Word Recognition:

Word detection and recognition done by using morphological operation and optical character recognition method. It has following steps,

Stage 1:

In this stage text saliency image acquisition and word region detection are performed. For the first, a text saliency image is taken as input. The image taken is in the RGB format. Then the image is converted to gray scale image. After converting the RGB image to gray image.

Stage 2:

In this stage word extraction and word segmentation is performed. For that morphological dilation and erosion operations are performed to fill holes. After applying morphological operations, local thresholding is applied to covert gray image into binary image. In order to get further contrast enhancement, intensity range of the pixel values are scaled between 0 to 1. In certain situations if some unwanted gaps and holes are present in the word region. Then region growing segmentation is performed to segment characters from the word region.

Stage 3:

In this stage word recognition is done using template matching. Each segmented character is matched with character templates stored in database. Finally the word was recognized.

Visual cue Encoding:

This stage is implemented for visual features extraction. SURF feature descriptor is used for visual features extraction. Strongest key points are extracted by SURF.

Classification and Retrieval:

The classification process is done over the recognized word and visual features. Based on the recognized word and features, classification and similar images retrieval are explored.



  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’LED
  • Input Devices : Keyboard, Mouse
  • Ram :1GB


  • Operating system : Windows 7.
  • Coding Language : MATLAB
  • Tool : MATLAB R2013A


SezerKaraogluy, Ran Taoy, Theo Gevers and Arnold W. M. Smeulders, “Words Matter: Scene Text for Image Classificationand Retrieval”, IEEE Transactions on Multimedia, 2017.


About the Author