Code: https://github.com/DevashishPrasad/CascadeTabNet
Scope: Table detection and table structure recognition
This paper: CascadeTabNet, single deep learning-based end-to-end model for two problems solved at once
Architecture:
- Cascade R-CNN (Cai and Vasconcelos, 2018) is a multistage model for detection
- HRNet (Wang et al, 2019) for semantic segmentation and object detection
- Foundings: Cascaded multistaged model with HRNet backbone yields the best result
- Network (implemented using mmdetection)
- Image feed to
HRNetV2p_W32
to produce feature map - RPN (region proposal network) takes feature map and predicts object proposals (B0, region of interest, RoI)
- B0 = boudning box classification scores + box regression points
- BBox takes RoI B0 and feature map and makes predictions (B1)
- Another BBox heads take B1 and feature map to produce B2, and B2 to produce B3 (final bounding box)
- A mask layer takes B2 and feature map to produce masks for objects (S, segmentation output)
- mmdetection:
cascade_mask_rcnn_hrnetv2p_w32_20e
- mmdetection:
- Branching at output of B2: Produces B3 and S (from segmentation branch)
- Image feed to
Transfer learning in 2 stages
- General task of table detection: Using general dataset
- images of different types of documents, containing tables of various types
- goal: detect tables in images
- Detect tables according to table types (bordered, semi-bordered, borderless)
- two-stage transfer learning strategy helps to use only small amount of data
First iteration
- start with CNN model with pretrained ImageNet COCO model weights
- learn only task-specific high level features
- result: predicts the detection masks for table in images
Second iteration
- model fine-tuned on smaller dataset for specific tasks
- predicting cell masks in borderless tables
- detecting table types
- need smaller data set with advanced annotations
- semibordered and borderless tables share the same class because the border is not helpful
Image augmentation techniques
- for generating training data, and avoid overfitting
- digital documents in datasets are perfectly axis-aligned
- should not do shear and rotation transform
- dilation transform: Thicken the black pixel regions
- convert into binary image first
- use a 2x2 kernel filter with one iteration to do the transform
- smudge transform: Spread black pixel to create smeary blurred black pixel region
- implemented using various distance transforms
- algorithm: Gilani et al (2017), using Euclidean distance transform, linear distance transform, max distance transform
Pipeline
- CascadeTabNet take images containing zero or more tables, predicts segmentation masks for tables (bordered and borderless)
- Branches: Bordered and Borderless
- Borderless: Predict cells, then arrange into rows nad columns based on their positions. Detect cells using contour-based text detection algorithm
- Bordered: Using line detection algorithm. No cell segmentation as line information is easier
Data
- General dataset is merge of 3:
- ICDAR19 (cTDaR)
- Marmot
- https://github.com/sgrpanchal31/table-detection-dataset
- Removed some ground truth errors in dataset
- Create 4 training sets
- original images
- original + dilate transform
- smudge transform
- original + dilate + smudge
- Evaluation using tablebank dataset
Bibliographic data
@inproceedings{
title = "CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents",
author = "Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure",
booktitle = "Proceedings of CVPR2020 Workshop on Text and Documents in the Deep Learning Era",
year = "2020",
address = "Seattle, WA",
}