New Space and Tech

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

By Keith Cowing
Status Report
February 23, 2023
Filed under , , ,
The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions
Detailed architecture diagram showing how a scanned page is processed through our full model pipeline (left panels) including OCR’ing and image processing, PDF-mining, feature generation and the subsequent feeding of features into the Mega-Yolo model, with model outputs feeding into the final found boxes. On the right we depict the annotation process performed on the scan page in order to generate ground truth models for comparision with our model’s found boxes. — cs.DL

Scientific articles published prior to the “age of digitization” in the late 1990s contain figures which are “trapped” within their scanned pages.

While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features.

We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify “high localization” levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

Jill P. Naiman, Peter K. G. Williams, Alyssa Goodman

Comments: 29 pages, 10 figures, accepted for publication in the International Journal on Digital Libraries, special issue follow up to TPDL 2022 conference. arXiv admin note: substantial text overlap with arXiv:2209.04460
Subjects: Digital Libraries (cs.DL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Cite as: arXiv:2302.11583 [cs.DL] (or arXiv:2302.11583v1 [cs.DL] for this version)
Focus to learn more
Submission history
From: Jill Naiman
[v1] Wed, 22 Feb 2023 19:00:01 UTC (4,503 KB)

SpaceRef co-founder, Explorers Club Fellow, ex-NASA, Away Teams, Journalist, Space & Astrobiology, Lapsed climber.