Welcome to pdf_bboxes!
Hi! This is a python CLI program which takes a PDF as an input and provides bounding boxes over each word on all pages of the PDF in a nice and clean json, image or csv format.
Note: This program only accepts 'Text PDFs' and not 'Image or Scanned PDFs'.
Note: This program only accepts 'Text PDFs' and not 'Image or Scanned PDFs'.
Usage :
This program can be used on any computer-generated PDFs, be it Invoices, Research Papers, e-books, etc. Its output can be used to make datasets for machine learning algorithms like GCN (Graph Convolutional Network) or other ML algorithms.git clone https://github.com/abhishekgautam101/pdf_bboxes.git
cd pdf_bboxes
pip install -r requirements.txt
How to run?
In order to run this program, you need to pass 3 mandatory arguments:--input INPUT Path of a PDF file.
--output_format {json,csv,img} 'json' for JSON, 'csv' for CSV, 'img' for images with bounding boxes; for all the pages of PDF.
python pdf_bboxes.py --input '/home/gautam/Desktop/python/ocr/SEAGATE.PDF' --output_format 'json' --output_folder '/home/gautam/Desktop/python/ocr/abcd/'
To get output as a CSV file :
To get output as a CSV file :
python pdf_bboxes.py --input '/home/gautam/Desktop/python/ocr/SEAGATE.PDF' --output_format 'csv' --output_folder '/home/gautam/Desktop/python/ocr/abcd/'
To get output as Images of each page of the PDF having bounding boxes over each word and recognized text over the bounding boxes :
To get output as Images of each page of the PDF having bounding boxes over each word and recognized text over the bounding boxes :
python pdf_bboxes.py --input '/home/gautam/Desktop/python/ocr/SEAGATE.PDF' --output_format 'img' --output_folder '/home/gautam/Desktop/python/ocr/abcd/'
Output JSON file and Images will be found in the output_folder you passed as an argument.
Output JSON file and Images will be found in the output_folder you passed as an argument.
You can find the source code here:
Buy Me a Coffee - https://www.buymeacoffee.com/agautam
It is really a great work and the way in which you are sharing the knowledge is excellent.Thanks for your informative article
ReplyDeleteInvoices Apps
Hi
ReplyDeletei used this command:
python pdf_bboxes.py --input 'empty_test.pdf' --output_format "img" --output_folder .'/results'
it got me:
The system cannot find the file specified.
Traceback (most recent call last):
File "pdf_bboxes.py", line 32, in
char_filepath = extract_chars(filepath, char_filename)
File "C:\Users\lenovo\Desktop\OCR-SiP\pdf_bboxes\helper.py", line 14, in extract_chars
subp.check_call(pdfplumber_cmd, shell=True)
File "C:\Users\lenovo\anaconda3\envs\TensorFlow_GPU_Test\lib\subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'pdfplumber --types char < ''empty_test.pdf'' > 'char_output_dir/'empty_test.pdf'_chars.csv'' returned non-zero exit status 1.
Please check that 'empty_test.pdf' is accessible, if not sure, give full path of 'empty_test.pdf'. And in your command [--output_folder .'/results'] , you placed dot(.) before apostrophe('). Please fix the paths that you are providing, everything should work fine. Thanks.
Delete