jPDFText Developer Guide
jPDFText Developer Guide
Contents
Introduction
Getting Started
Extracting Text
Extracting Text Page by Page
Extracting Words as a Vector of Strings
Extracting Words Page by Page
Getting Basic Document Information
Distribution and JAR files
Javadoc API
Source Code Samples
Introduction
jPDFText is a Java library that integrates seamlessly into your application to extract words from PDF documents. jPDFText provides the following functions:
- Load PDF documents from files, network drives, URLs or input streams
- Get basic information from the pdf document such as title, author, keywords, page count, etc.
- Extract words from pdf documents as a vector of String
- Extract words page by page
Like all of our libraries, jPDFText is built on top of Qoppa’s proprietary format and doesn’t require any third party programs or drivers.
Getting Started
The starting point for using jPDFText is the com.qoppa.pdfText.PDFText. This class is used to load a pdf document and extract the text from the document. The class provides three constructors to load PDF files from the file system, a URL or an InputStream. All constructors take an additional parameter, an object that implements IPasswordHandler, that will be queried if the PDF file has requires a password to open. For PDF files that are not encrypted, this second parameter can be null:
PDFText pdfText = new PDFText (new URL("http://www.mysite.com/content.pdf"), null); |
Extract Text
Once a PDFText object has been created, the host application simply needs to call the getText method to get the text from the loaded PDF document. The text is returned as a String.
// get text as a String String text = pdfText.getText(); // print the text System.out.println(text); |
Extracting Text Page by Page
To extract the text page by page, use the getText method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.
// get page count int pageCount = pdfText.getPageCount(); for(int count = 0; count < pageCount; count++) { // get text for page (count+1) String text = getText(count+1); // print the text System.out.println(" PAGE " + (count + 1)); System.out.println(text); } |
Extracting Words as a Vector of Strings
Once a PDFText object has been created, the host application simply needs to call the getWords method to get the list of words from the loaded PDF document.
// get list of words as a vector of strings Vector words = pdfText.getWords(); // loop through the words and print them for(int count = 0; count < words.size(); count++) { System.out.println("[" + words.get(count) + "] "); } |
Extracting Words Page by Page
To extract words page by page, use the getWords method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.
// get page count int pageCount = pdfText.getPageCount(); for(int count = 0; count < pageCount; count++) { // get words for page (count+1) Vector words = getWords(count+1); // loop through the words and print them System.out.println(" PAGE " + (count + 1)); for(int wordCount = 0; wordCount < words.size(); wordCount++) { System.out.println("[" + words.get(wordCount) + "] "); } } |
Getting Basic Information about the PDF Document (Title, Author, etc.)
To get basic information about the loaded PDF document, you need to get the DocumentInfo class accessible through PDFText.getDocumentInfo. From this class, you can get information about the document such as title, author, subject, keywords, etc.
System.out.println(pdfText.getDocumentInfo().getTitle()); System.out.println(pdfText.getDocumentInfo().getAuthor()); System.out.println(pdfText.getDocumentInfo().getKeywords()); |
Distribution and JAR Files
Required and optional jar files for jPDFText can be found on the jPDFText Download page.