Information extraction from scanned documents

Please use this identifier to cite or link to this item: https://irju.jdvu.ac.in/jspui/handle/123456789/9002

Title:	Information extraction from scanned documents
Authors:	Das, Palash Priya
Advisors:	Chattopadhyay, Matangini
Keywords:	Digital contents;Tesseract or Eclipse IDE
Issue Date:	2023
Publisher:	Jadavpur University, Kolkata, West Bengal
Abstract:	One of the most important parts of education system is documents. Today in order to carry out research work, students and teachers need already published research papers. These research papers are available in the form of digital contents, mostly as pdf, docx, etc. Text extraction from documents is sometimes not possible for the cases where we cannot copy text to put it into another documents. In pdf files or scanned documents we cannot select or edit or perform copy paste or search operation anything. The solution to this problem is text extraction using remote services and using Programmatic Access to make our documents ready. This research work presents how text is extracted from documents, specially in question papers and various other files. Here, using Tesseract or Eclipse IDE we have extracted text from various documents and pdf files. We have also used Postman, Curl to fetch or extract text . Also we have use Amazon Textract and Google Cloud Console to fetch or extract text and their programmatic access.
URI:	http://20.198.91.3:8080/jspui/handle/123456789/9002
Appears in Collections:	Dissertation

Files in This Item:

File	Description	Size	Format
M.Tech (School of Education Technology) Palash Priya Das.pdf		5.84 MB	Adobe PDF	View/Open

IR@JU Digital Repository