KBA-01124: Support for full text search of PDF files

Question:

Why can‘t I find a PDF file by using a full text search?

Answer:

The full text search feature is driven by Microsoft Indexing Service.   By default, the Indexing Service supports DOC, DOCX, XLS, XLSX, PPT, TXT and HTM files.  Spitfire sfATC extends support to PDF files by extracting text and storing the text alongside the PDF file. This process may use OCR or text directly extracted from the file.  Due to this extra processing, there is a delay between when the file is cataloged and when the text is available for indexing. Additionally, Adobe supplies a free download to add in support for PDF files with text.

You download the IFilter installation from Adobe, for example here.   Or search for the latest.

Please be aware that an iFilter can only index PDF files that contain text—some PDF files (particularly those created by a copier/scanner) contain only images.  If the iFilter does not support GZ compressed file streams, use ICTool to add the PDF extension to the list of file types that are excluded from compression.After downloading and installing the add-in, reboot the server (or restart Indexing Service) and use Enterprise Manager to rebuild your full text catalogs.

Additional Comments:

See the Adobe readme file included with the download for additional information.  As you might expect, there are limitations such as:

  • PDF iFilter will not extract text from PDF files that are password-protected.  Password-protected PDF files will not appear on an unfiltered files list.
  • PDF iFilter will not extract text from PDF files that have protection against copying.
  • PDF files composed only of images are not supported by the iFilter. The Adobe Reader find text feature does not work for such PDF files either.  sfATC will make an attempt to extract images from these PDF files and OCR the images using Microsoft Office Document Imaging.

See also KBA-01327 for similar support for DWG files.


KBA-01124; Last updated: March 6, 2019 at 15:02 pm;