An Efficient and Accurate Scene Text Detector [EAST]

Recently there are many approaches that are used to extract information from the natural scene. Some of them are conventional while others are deep learning approaches. Deep Learning approaches are far accurate than other conventional approaches. We are going to explore one of the deep learning-based text detection algorithm EAST(Efficient and Accurate Scene Text Detector).
An Efficient and Accurate Scene Text Detector [EAST]- The JAY Tech

EAST makes use of a single neural network for the detection of text in a natural scene and gives the output as a quadrilateral shape in the text. It uses Non-Max Suppression (NMS) along with Convolutional Network. The convolutional network is used to detect the word in the image and NMS is used to merge all those detected text/boxes into one large text or box. The neural network is trained directly to predict the text and the geometry of the text in the natural scene. This approach is devised for text detection that outputs the dense per-pixel prediction of texts.


An Efficient and Accurate Scene Text Detector [EAST]- The JAY Tech

The general pipeline of this approach is as shown in the figure below. At first, an image is sent to a Fully Convolutional Network(FCN) in which pixel-level text score maps and geometry maps are generated. This way a general Dense box is generated. two geometric shapes for text regions are available. One is a rotated box(RBOX) and another is a quad box(QUAD). The loss function for both the maps ie score map and geometry map is generated then Thresholding is applied on that and if the score is greater than the predefined prediction that region is passed further to the non-max suppression (NMS). And the output after NMS is the final output.

Neural Network Design

There are many aspects that should be taken care of while designing a neural network. There is a lot of variety in the size of images and text in the natural scene. The geometry of the text cannot be generalized easily. For this Hypernet is used in the feature maps. By the use of Hypernet and UShape a  network that can utilize different levels of features and with minimum computation cost. The schematic view of this approach is shown below.
An Efficient and Accurate Scene Text Detector [EAST]- The JAY Tech
From the Research Paper[A]

This model can be categorized into three branches; 
  • Feature Extracting Stem 
  • Feature Merging Branch  
  • Output layer

Feature Extracting Stem:

This branch can be a convolutional layer trained on the dataset of image net having interleaving convolution and pooling layers. The researcher of the EAST used both PVANet and VGG16 for the experiment. You can make reference to the actual research paper mentioned at the bottom of this article.

Feature Merging Branch:

This layer merges the featured output from the different layers. Manual merging is expensive so U-Shape is used. It produces the final feature map and sends it to the output layer.

Output Layer:

The output layer consists of a score and a geometry map. The geometry map can either be in RBOX or QUAD. The geometry map consists of the coordinates of the box in the natural scene.

Loss Function:

An Efficient and Accurate Scene Text Detector [EAST]- The JAY Tech

Both the score map loss and geometry loss function is used in the loss function of the EAST.
The lambda in the formula is the weight to balance the losses. Its value is used as 1 in the research paper.


ADAM optimizer is used for end to end training of this network. It is trained until the performance starts to improve. Learning Rate for ADAM starts from 1e -3.

Non-Max Suppression Merging(NMS):

The geometry obtained after thresholding is then passed to the NMS phase from where the final result is obtained. The output from the thresholding is suppressed using locality aware NMS. In the best scenario and suppression is made row by row, EAST runs in O(n) where n is the number of candidate geometries. The conventional naive NMS runs in O(n*n).

This is all about the East. You can make reference to the original research paper for exploring further. The link to the research paper is below in the image.
An Efficient and Accurate Scene Text Detector [EAST]- The JAY Tech

One of the popular implementations of EAST is here.
If you like this article don't forget to follow us.

Post a Comment

To Top