Python , Tensor flow 2

FOTS: SCENE TEXT PARSING

Implementation of FOTS Paper

10 min readJan 17, 2021

Here this blog is organized as follows:
1. Scene Text Detection and Why It’s Needed
2. Problem Statement
3. Dataset Used
4. Losses Used
5. Traditional Approaches
6. Machine Learning Pipeline
7. Results
8. Profile
9. Post Training Quantization & Deployment
10. Future Work
11. References

1. Scene Text Detection and Why It’s Needed

Scene Text Detection and Recognition is a very basic task that all people do daily in their Life. In daily life text is present in variety of natural scenes such as road signs, shop names, posters, signboards etc. Sometime these text present in images even help one to understand context of Image. This Scene Text Detection mechanism can even be applied to vehicle identification system where we detect number from number plate of the vehicle.

Why Text Detection?

In World of Digital Era we have various Text Images ,Documents. If we humans manually detect all images in these large data’s , it is sure we may miss something. So We need these Text Detection models which make our works easy.
These methods have made humans work more easier in applications like Passport recognition, automatic number plate recognition, converting handwritten text to digital text etc.

2. Problem Statement

It is a combination of a computer vision & NLP problem. Here we will take Images as input and create bounding boxes for text present in images and label them with the text recognized by our model. Here we will try to extract text from complex images which contains many things like complex background, complex objects, text like strokes etc.

Basically this problem consist of 3 parts Text Detection , ROI Rotate & Text Recognition.
Text Detection is a process in which we extract text present in images & get a Bounding Box for the same(somehow similar to object detection).
ROI Rotate is process in which we make all differently oriented images to horizontal orientation so that its easy for Text Recognition branch to work.
Text Recognition is process in which we get text from the extracted text regions in the image.

Here while Modeling we have trained both Text Detection and Text Recognition Branch Separately. And build complete pipeline by using them with ROI Rotate at last.

3. Dataset Used

Here we are basically using ICDAR 2015 Robust Reading competition dataset for our model. This dataset includes 1000 training images and 500 testing images. These images are captured by Google glasses without taking care of position.

This Dataset is organized as following directory:-

ch4_training_images:- This directory contain all 1000 images that we have used for training out text detection branch
ch4_training_localization_transcription_gt:- This directory contains 1000 files of type gt_img_<img_no> which contians coordinate as well as words that are present in that particular train image. Some words here are marked as ### to denote Don’t Care Words in that image.
ch4_training_word_images_gt:- This directory contains 4468 word images and 2 files names coords.txt and gt.txt. coords.txt contains coordinate of exact text in images present in this directory and gt.txt contains particular English word for corresponding image in this directory.
ch4_test_images:- This directory contains 500 test images.

Some Random Images for training set of ICDAR 2015 dataset.

To Avoid overfitting in Text Recognition Branch because of small set of 4468 images in ICDAR 2015 dataset we have also make use of 130000 images from SyntText Dataset. Also because of issue like presence of vertical text word we have not used these 4468 while training our Text Recognition Branch instead we used combination of SynthText and manually generating word images from original scene images in ICDAR 2015 dataset.

some random images of SythText Dataset

4. Losses Used

As we have trained both Recognition and Detection branch separately we have used variety of losses in case Study. Losses we have used are as follows:-

Dice Loss : We have used this loss in text Detection Branch for classifying each pixel in input images as text or non- text region.

Our code for Dice Loss

Where X is actual region and Y is predicted region. To read more about Dice Loss click here.

2. IOU(Intersection Over Union) Loss: This loss is used for training in Tet Detection Branch for getting proper bounding boxes for text regions.

To read about about IOU Loss click here.

our code for IOU Loss

3. CTC(Connectionist Temporal Categorical) Loss: This loss is used for Training Text Recognition Branch so as to convert Text Region given by text Detection branch to text. To get more details about this loss click here.

5. Traditional Approaches

Some Traditional approaches before deep learning era to solve this problem were Connected Component Analysis(CCA) or Sliding Window. CCA methods first extract candidate components using various techniques like color clustering or extreme region extraction and then filter non-text region by using manual classification rules. In Sliding window method windows of varying sizes slides over images and window is classified as text or not text region. To read more about these methods click here. For text recognition branch there are several Character level Segmentation techniques that were used.

Also Earlier most approaches use only Text Detection and Text Recognition branch while doing this scene text parsing. But the FOTS method that we have tried use additional ROI rotate also.

Also There exists many solutions like Tesseract OCR but these methods have several drawbacks like it works on images plain white background text.

6. Machine Learning Pipeline

As Shown is Figure Basically we have 3 Major Branches in our Architecture Text Detection Branch, ROI Rotate and Text Recognition Branch. Shared Convolutions are added while preparing Text Detection Branch. For Simplicity we have trained both Text Detection and Text Recognition Branch separately and combined all three in Final Inference pipeline.

6.1. DATA PREPERATION

Data preparation is most important part in our model. We have tried to construct generators that can fit for both text detection branch and text recognition branch separately.

6.1.1 DATA GENERATOR FOR TEXT DETECTION

While creating generator function for text detection we are basically creating 3 things that will be useful while training this model.

Score Map:- For each pixel it denotes whether it is a text region or non-text region(1 for text region and 0 for non-text region).
Geo Map:-It is a 5 channel data. First 4 channels contains distances to top, bottom, left, right sides of the bounding box that contains text region pixel given by score map and last channel contains orientation of bounding box.
Training Mask:- It contains Don’t Care Text Region that we have to ignore while training our Detection Branch.

Also We have converted all images to resolution of (512,512,3) in this generator so that text detection branch works with this resolution images only.

Original Image, Score Map, 5 Geo Maps and Training Mask of Image

6.1.2. DATA GENERATOR FOT TEXT RECOGNITION

In Text Recognition Model we have created 2 generators (Train and Test). We have converted images to size (64,128,3) and converted words for particular word image as vector of size 23. Here for these vectors we have created vocabulary which contains all possible characters that can constitute a word.

code for vocabulary of Recognition branch

For More code details on these generators click here.

6.2 MODEL TRAINING

Here we have trained both Text Detection Branch and Text Recognition branch separately and finally for inference combined these both with ROI rotate . Overview of Training these branches is discussed below:

6.2.1. TEXT DETECTION BRANCH

While preparing text detection branch we have used pretrained Resnet50 model on image net dataset. Also here we have first downscaled image by factor of 1/32 and then upscaled by factor of 1/32.(This upscaling is different from paper where they have upscaled by factor of 1/16 only). While Training this Branch we have used 2 losses IOU loss and Dice Loss. For more details click here.

To avoid domination of dice loss over IOU loss we have given complete loss as:

Loss=0.01*Dice Loss + IOU Loss

Detailed Architecture of this detection Branch is shown in below code:

While Training this text detection branch we are using batch size of 10 and have trained this branch for 50 epochs. Here while training we have used various Tensor flow callbacks like:

Reduce On Plateau Callback:- this is used to reduce learning rate when our model weights get stuck on local Minima.
Model Check Pointing Callback :- this is used to save model weights while training.
Tensor board Callback:- used to visualize loss and weights of layers in tensor board.

Snapshot of training this branch is given below:-

Tensor board Loss Graph:-

Clearly form this Tensor board loss graph we see that after reaching 50 epochs the loss curve starts flattening. So we have only trained this model for 50 Epochs.

6.2.2. TEXT RECOGNITION BRANCH

While preparing architecture for text Recognition branch we have used series of Convolution Operations followed by Batch Normalization, Relu Activation and Max Pooling(Which reduces dimensions by half along height axis only).

Finally after these operation we have used 2 layers of Bidirectional GRU and finally a dense layer with 100 units.(Here 100 is used because our vocabulary is 99 and we need one extra for blank symbol while using CTC loss and CTC Decoder).

Here we have used CTC Loss for training this network architecture.

While Training this Text Recognition branch we have used batch size of 128 and Trained this whole architecture mentioned above with Synth Text dataset so as to avoid overfitting of model. Here also while training we have used various callback functions like Tensor Board Callback, Reduce On Plateau Callback, Model Check Pointing Callback etc. For more implementation details click here.

Snapshot of training is given as:-

Tensor board Loss Graph:-

Clearly here Red curve shows CTC Loss for Train data and Blue curve shows for Test data. Clearly from here we get model is not overfitting as difference between train and test loss is small only.

6.3. FINAL INFERENCE PIPELINE

Final Inference pipeline consists of all 3 parts Text detection(includes shared convolutions), RoI rotate and Text Recognition. In RoI Rotate we generate coordinated of boxes where text is present in our images by making use of Score Map and Geo Maps that are predicted by Text Detection branch. After getting these coordinates we have simply applied Text Recognition branch to get text for that particular Text Image. click here to see Inference pipeline in more detail.

7. RESULTS

Some Images on which our model Performs well are:-

Clearly Our Model is not able to do well with images which have complex text regions with complex background. Some images on which model performs poor are shown below:-

8. Profile

This section I am sharing my LinkedIn Profile to know my Background and Github Repository Link from where you can see more implementation details regarding this paper.

Click here to visit LinkedIn Profile & here to visit Git hub Repository of code.

9. Post Training Quantization & Deployment

Here where we have applied Tensor Flow Post Training Quantization Techniques like Dynamic Quantization and Float16 Quantization. Although size of models got reduced largely by these Quantization Techniques but latency in Dynamic Quantization is very large due to less computation power. To see more about our implementation techniques click here.

After Doing These Quantization techniques we have tried to deploy our code using Streamlit Library in Zeet. click here to see our deployed model.

I have also created Video Regarding same model Deployment.

10. Future Work

Here due to computation and memory limitation we can’t train text detection branch with SynthText dataset due to which our model not works well on text which have multiple(complex) background As Shown in previous Results Section. Also with more computation power we can use complete SynthText Dataset for training Text Recognition Branch(currently have used only 130000 images). Also there is scope to train model with original Image Resolution in Future(We have not done this too because of less computation power).

Also For now Our Model works on single Language(English) by using more data form Synth Text or ICDAR 2017 we can construct multi lingual text parsing model.