How to Extract Text from PDF Files with Google Apps Script

This tutorial explains how you can parse and extract text elements from invoices, expense receipts and other PDF documents with the help of Apps Script.

An external accounting system generates paper receipts for its customers which are then scanned as PDF files and uploaded to a folder in Google Drive. These PDF invoices have to be parsed and specific information, like the invoice number, the invoice date and the buyer’s email address, needs to be extracted and saved into a Google Spreadsheet.

Here’s a sample PDF invoice that we’ll use in this example.

PDF Invoice for Extraction

Our PDF extractor script will read the file from Google Drive and use Google Drive API to convert to a text file. We can then use RegEx to parse this text file and write the extracted information into a Google Sheet.

Let’s get started.

Step 1. Convert PDF to Text

Assuming that the PDF files is already in our Google Drive, we’ll write a little function that will convert the PDF file to text. Please ensure the Advanced Drive API as describes in this tutorial.

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

Step 2: Extract Information from Text

Now that we have the text content of the PDF file, we can use RegEx to extract the information we need. I’ve highlighted the text elements that we need to save in the Google Sheet and the RegEx pattern that will help us extract the required information.

Text Content of PDF

const extractInformationFromPDFText = (textContent) => {
  const pattern = /Invoice\sDate\s(.+?)\sInvoice\sNumber\s(.+?)\s/;
  const matches = textContent.replace(/\n/g, ' ').match(pattern) || [];
  const [, invoiceDate, invoiceNumber] = matches;
  return { invoiceDate, invoiceNumber };
};

You may have to tweak the RegEx pattern based on the unique structure of your PDF file.

Step 3: Save Information to Google Sheet

This is the easiest part. We can use the Google Sheets API to easily write the extracted information into a Google Sheet.

const writeToGoogleSheet = ({ invoiceDate, invoiceNumber }) => {
  const spreadsheetId = '<<Google Spreadsheet ID>>';
  const sheetName = '<<Sheet Name>>';
  const sheet = SpreadsheetApp.openById(spreadsheetId).getSheetByName(sheetName);
  if (sheet.getLastRow() === 0) {
    sheet.appendRow(['Invoice Date', 'Invoice Number']);
  }
  sheet.appendRow([invoiceDate, invoiceNumber]);
  SpreadsheetApp.flush();
};

If you a more complex PDF, you may consider using a commercial API that use Machine Learning to analyze the layout of documents and extract specific information at scale Some popular web services for extracting PDF data include Amazon Textract, Adobe’s Extract API and Google’s own Vision AI.They all offer generous free tiers for small-scale use.

PDF JSON

Amit Agarwal

Amit Agarwal

Google Developer Expert, Google Cloud Champion

Amit Agarwal is a Google Developer Expert in Google Workspace and Google Apps Script. He holds an engineering degree in Computer Science (I.I.T.) and is the first professional blogger in India.

Amit has developed several popular Google add-ons including Mail Merge for Gmail and Document Studio. Read more on Lifehacker and YourStory

0

Awards & Titles

Digital Inspiration has won several awards since it's launch in 2004.

Google Developer Expert

Google Developer Expert

Google awarded us the Google Developer Expert award recogizing our work in Google Workspace.

ProductHunt Golden Kitty

ProductHunt Golden Kitty

Our Gmail tool won the Lifehack of the Year award at ProductHunt Golden Kitty Awards in 2017.

Microsoft MVP Alumni

Microsoft MVP Alumni

Microsoft awarded us the Most Valuable Professional (MVP) title for 5 years in a row.

Google Cloud Champion

Google Cloud Champion

Google awarded us the Champion Innovator title recognizing our technical skill and expertise.

Email Newsletter

Sign up for our email newsletter to stay up to date.

We will never send any spam emails. Promise.