Introducing CRAFT: Character Region Awareness for Text Detection
Visualization of character-level detection using CRAFT |
CRAFT is designed with a convolutional neural network producing the character region score and affinity score. The
region score is used to localize individual characters in
the image, and the affinity score is used to group each
character into a single instance. To compensate for the
lack of character-level annotations, It uses a weakly supervised learning framework that estimates character-level ground truths in existing real word-level datasets.
The first task for the detection of text is to precisely localize each individual character in a natural image. To this end, we train a deep
neural network to predict character regions and the affinity between characters. Since there is no public character-level dataset available, the model is trained in a weakly-supervised manner.
Architecture
The backbone of CRAFT is a fully convolutional network architecture based on VGG-16 () with batch normalization. The final output of the model has two channels as score mappings: the affinity score and the region score.
Schematic illustration of CRAFT network architecture |
The figure above depicts the network architecture schematically.
Training
Ground Truth Label Generation
For every training image, CRAFT generates the ground truth label for both affinity and region score with character-level bounding boxes. The region score denotes the probability that the given pixel is the center of the character and the affinity score denotes the center probability of the space between adjacent characters. Unlike a binary segmentation map, which labels each
pixel discretely, we encode the probability of the character
center with a Gaussian heatmap. The figure below summarizes the label generation pipeline for a synthetic image.
Illustration of ground truth generation procedure in CRAFT framework |
Weakly-Supervised Learning
Unlike synthetic datasets, real images in a dataset usually have word-level annotations. Here, we generate character boxes from each word-level annotation in a weakly-supervised manner, as summarized in the image below.
Illustration of the overall training stream for the proposed method |
When a real
image with word-level annotations is provided, the learned
interim model predicts the character region score of the
cropped word images to generate character-level bounding boxes. In order to reflect the reliability of the interim
model’s prediction, the value of the confidence map over
each word box is computed proportionally to the number of
the detected characters divided by the number of the ground
truth characters, which is used for the learning weight during training.
Discussion
Multi-language Issue
The IC17 dataset contains Bangla
and Arabic characters, which are not included in the synthetic text dataset. Moreover, both languages are difficult
to segment into characters individually because every character is written cursively. Therefore, CRAFT could not
distinguish Bangla and Arabic characters as well as it does
Latin, Korean, Chinese, and Japanese. In East Asian characters’ cases, they can be easily separated with a constant
width, which helps train the model to high performance via
weakly-supervision
Comparison with End-to-end methods
CRAFT is
trained with the ground truth boxes only for detection, but
it is comparable with other end-to-end methods. From the analysis of failure cases, It is expected that CRAFT will benefit from the recognition results, especially
when the ground truth words are separated by semantics,
rather than visual cues
Generalization ability
CRAFT is capable of
capturing general characteristics of texts, rather than overfitting them to a particular dataset.
Post a Comment