| annotations_creators |
|
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| language |
|
||||||||||||
| language_creators |
|
||||||||||||
| license |
|
||||||||||||
| multilinguality |
|
||||||||||||
| pretty_name | PubLayNet | ||||||||||||
| size_categories | |||||||||||||
| source_datasets |
|
||||||||||||
| tags |
|
||||||||||||
| task_categories |
|
||||||||||||
| task_ids |
|
- Dataset Card Creation Guide
- Homepage: https://developer.ibm.com/exchanges/data/all/publaynet/
- Repository: https://github.com/shunk031/huggingface-datasets_PubLayNet
- Paper (Preprint): https://arxiv.org/abs/1908.07836
- Paper (ICDAR2019): https://ieeexplore.ieee.org/document/8977963
PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as "text", "list", "figure" etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.
[More Information Needed]
[More Information Needed]
import datasets as ds
dataset = ds.load_dataset(
path="shunk031/PubLayNet",
decode_rle=True, # True if Run-length Encoding (RLE) is to be decoded and converted to binary mask.
)[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{zhong2019publaynet,
title={Publaynet: largest dataset ever for document layout analysis},
author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
pages={1015--1022},
year={2019},
organization={IEEE}
}Thanks to ibm-aur-nlp/PubLayNet for creating this dataset.