Introducing CRAFT: Character Region Awareness for Text Detection

Visualization of character-level detection using CRAFT.
Visualization of character-level detection using CRAFT
Character Region Awareness for Text Detection(CRAFT) is a novel scene text detector that is based on neural networks.CRAFT was proposed by Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee in 3 April 2019. CRAFT aims to effectively detect text areas by exploring each character and affinity between characters. It localizes the individual character regions and links the detected characters to a text instance. This network is trained with the newly proposed representation for affinity. It has experimented with six benchmarks including the TotalText and CTW-1500 datasets which contain highly curved texts in natural images where it demonstrated that it significantly outperforms the state-of-the-art detectors.
 CRAFT is designed with a convolutional neural network producing the character region score and affinity score. The region score is used to localize individual characters in the image, and the affinity score is used to group each character into a single instance. To compensate for the lack of character-level annotations, It uses a weakly supervised learning framework that estimates character-level ground truths in existing real word-level datasets.
The first task for the detection of text is to precisely localize each individual character in a natural image. To this end, we train a deep neural network to predict character regions and the affinity between characters. Since there is no public character-level dataset available, the model is trained in a weakly-supervised manner.


The backbone of CRAFT is a fully convolutional network architecture based on VGG-16 () with batch normalization. The final output of the model has two channels as score mappings: the affinity score and the region score
Schematic illustration of CRAFT network architecture
 Schematic illustration of CRAFT network architecture
The figure above depicts the network architecture schematically.


Ground Truth Label Generation

For every training image, CRAFT generates the ground truth label for both affinity and region score with character-level bounding boxes. The region score denotes the probability that the given pixel is the center of the character and the affinity score denotes the center probability of the space between adjacent characters. Unlike a binary segmentation map, which labels each pixel discretely, we encode the probability of the character center with a Gaussian heatmap. The figure below summarizes the label generation pipeline for a synthetic image.
Illustration of ground truth generation procedure in CRAFT framework
Illustration of ground truth generation procedure in CRAFT framework

Weakly-Supervised Learning

Unlike synthetic datasets, real images in a dataset usually have word-level annotations. Here, we generate character boxes from each word-level annotation in a weakly-supervised manner, as summarized in the image below.
Illustration of the overall training stream for the proposed method
Illustration of the overall training stream for the proposed method
When a real image with word-level annotations is provided, the learned interim model predicts the character region score of the cropped word images to generate character-level bounding boxes. In order to reflect the reliability of the interim model’s prediction, the value of the confidence map over each word box is computed proportionally to the number of the detected characters divided by the number of the ground truth characters, which is used for the learning weight during training.


Multi-language Issue

The IC17 dataset contains Bangla and Arabic characters, which are not included in the synthetic text dataset. Moreover, both languages are difficult to segment into characters individually because every character is written cursively. Therefore, CRAFT could not distinguish Bangla and Arabic characters as well as it does Latin, Korean, Chinese, and Japanese. In East Asian characters’ cases, they can be easily separated with a constant width, which helps train the model to high performance via weakly-supervision

Comparison with End-to-end methods 

CRAFT is trained with the ground truth boxes only for detection, but it is comparable with other end-to-end methods. From the analysis of failure cases, It is expected that CRAFT  will benefit from the recognition results, especially when the ground truth words are separated by semantics, rather than visual cues

Generalization ability

 CRAFT is capable of capturing general characteristics of texts, rather than overfitting them to a particular dataset.


This article is completely based on the original paper of the CRAFT. All the images are taken from the paper itself. The original paper can be found here. The code to the CRAFT original implementation is open source and is available here.

Post a Comment

To Top