Using Alternative Filters for Filtering PDF Files

Contents

Introduction

Oracle Text licences filters from Stellent Corporation (known as the INSO filters from a previous vendor and the internal name: INSO_FILTER) for filtering many file formats. Unfortunately these filters are not available for all the platforms on which Oracle Text runs, and they do occasionally have problems filtering certain PDF files.

This paper presents some alternative to the Stellent/INSO filters for users who either do not have access to them, or who do not wish to use them (for whatever reason).

Microsoft's Index Server does not support PDF filtering at all, but Adobe Systems have provided a "plug-in" filter for PDF files for use with Index Server. With a small amount of work, this same filter may be used within Oracle Text as well. This software is provided free of charge by Adobe, though it does require a full version of Adobe Acrobat to be installed ( not the free Acrobat Reader).

Alternatively, BCL Computers have a commercial product known as Magellan, which is a plug-in for Adobe Acrobat (and hence also requires the full version of Acrobat). A time and function limited demo version of Magellan is available for testing purposes, but the full version must be purchased from BCL Computers ( http://www.bclcomputers.com).

Finally, there is XPDF. This is free software licenced under the GNU Public Licence (though commercial licences are available if prefered). The great advantage of XPDF is that is runs on almost any platform, and does not require Adobe Acrobat to be installed. The disadvantage is that it is rather simplistic about text extraction, and often confuses the ordering of words.

Warnings and Limitations

The techniques described herein are not officially supported by Oracle Corporation. Customers make use of them at their own risk.

The author believes that there are no licencing issues surrounding the software described in this article. However, both downloads require you to agree to a software licence, and it is the customer's responsibility to verify the legal position with respect to these agreements.

Limitations

The methods described here will only work on Windows systems. They have been tested under Windows 2000, but should work equally well on NT4.0 or Windows XP.

Whilst fast and accurate, the Adobe ifilter produces simple text output intended for indexing, not display purposes. Some effects of this are:

  • No "within paragraph" searches - all the text is run together and paragraph breaks are not visible.
  • No pretty layout, so not very suitable for display purposes.

BCL's Magellan product produces much better formatted text, which avoids the limitations above. It can also generate web pages complete with images, but unfortunately this is difficult to use within the Oracle Text filter framework.

Adobe IFilter

As mentioned above, the Adobe IFilter is intended for use with Microsoft Index Server. It is therefore written as a DLL (shared library) which implements Microsofts IFilter COM interface. Since Oracle Text makes use of a more straightforward command-line/file interface for custom filters, we need to utilize a wrapper around the DLL. It should be possible to write such a wrapper from scratch in C++, but fortunately Microsoft have provided a debugging utility program which is exactly what we need.

Downloads

There are two files needed for this. The first is PDFFilt.dll, available from Adobe Corporation at

http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276

This file contains an installer, which will install PDFFilt.dll into a directory, by default C:\Program Files\Adobe\PDF IFilter 5.0 You must ensure that your PATH variable includes this directory, or else copy the DLL file to somewhere on your path, such as C:\WINNT.

The second file is from Microsoft, and is a tool called filtdump.exe, originally intended as a debugging tool for Index Server. See

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnindex/html/msdn_is-index.asp

There is a download link at the top for 5180.exe - a self-extracting zip archive containing FiltDump.exe and FiltReg.exe. We only need the first of these files. Extract FiltDump.exe and put it somewhere in your PATH - I suggest %ORACLE_HOME\bin. Note that the archive has a path name of "5180" coded into it already - WinZip will put it into a directory with that name under the directory where you extract the file - you will need to rename it afterwards.

Implementing the User Filter

First we must decide whether we need to filter any kind of documents (in which case the IFilter software will be used ONLY for PDF files, and INSO will be used for the rest), or whether we are dealing only with PDF documents. The second situation is simpler.

Filtering only PDF documents

Use the following commands when setting up your filter preference:
                                         
execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter')
 execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat')

                                      
This tells Oracle Text to use the a batch file, ifilter.bat to filter the documents. The actual filter program requires that the file to be filtered has a .PDF extension - and this is not normally true of the temporary files generated by Oralce Text for filtering. So our batch file must copy the file to a temporary file with a .PDF extension, then invoke the actual filter.

So we must create the file ifilter.bat in %ORACLE_HOME\bin with the following lines in it:

                                         
copy %1 %1.pdf

FiltDump.exe -b %1.pdf > %2
del %1.pdf

                                      

Testing

Here is the full SQL required to test the filter we have now created. It uses the file datastore, so you MUST set the name of the PDF file you want to test, and the 'path' setting in the set_attribute statement to reflect the location of your file.
                                         
/* drops will give "not found" errors on first run */

 drop table xpdf;
 execute ctx_ddl.drop_preference ('my_file_datastore')
 execute ctx_ddl.drop_preference ('my_pdf_filter')


 create table xpdf (pk number primary key, thefile varchar2(256));

  
                                        
/* Change the name below to suit your document */  
                                        
insert into xpdf values (1, 'book.pdf');

 execute ctx_ddl.create_preference ('my_file_datastore', 'file_datastore')

  
                                        
/* Change the path here to suit your system */  
                                        
execute ctx_ddl.set_attribute ('my_file_datastore', 'path', 'e:\')

 execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter')
 execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat')

 create index xpdfi on xpdf (thefile) indextype is ctxsys.context
 parameters ('datastore my_file_datastore filter my_pdf_filter');

 select * from ctx_user_index_errors;

                                      

Debugging

  • remember to check CTX_USER_INDEX_ERRORS
  • select token_text from dr$xpdfi$i to see if anything has been indexed

Modified ifilter.bat for Debugging

the following modified bat file copies the PDF input file to e:\dbgin.pdf and the plain text output to e:\dbgout.txt if everything worked succesfully. You will probably need to change E:\ to point to your own temporary directory. This way, you can check that the filter is getting a valid PDF file by attempting to open e:\dbgin.pdf, and you can try running FiltDump.exe from the command line on the same file.
                                         
REM debug version of filter file
copy %1 %1.pdf
copy %1.pdf e:\dbgin.pdf

FiltDump.exe -b %1.pdf > %2
del %1.pdf
copy %2 e:\dbgout.txt

                                      

Filtering any Documents

To be able to mix PDF and other types of documents, we need to be able to detect whether a document is PDF, in order to pass it to the Adobe filter, and if not we need to call the normal CTXHX executable.

To enable this, I have written a small C program which examines the first five bytes of the file to check for the string " %PDF-" which all PDF files contain. If it finds it, it returns a value of 0, and if it does not it returns 1 (2 indicates an error condition such as "file not found").

Download testForPDF.exe and put it somewhere in your PATH, such as %ORACLE_HOME\bin . (if you're interested, the source code is here). Use the same setup as above, but this time the ifilter.bat should look like this:

                                         
@echo off

TestForPDF.exe %1

if errorlevel 2 goto failed
if errorlevel 1 goto notapdf

REM passed OK - this is a pdf file
copy %1 %1.pdf
FiltDump.exe -b %1.pdf > %2
del %1.pdf
goto end

:failed
REM TestForPDF.exe generated an error
goto end


:notapdf
REM No - that is not a pdf file - use ctxhx
ctxhx.exe %1 %2 %3 %4 %5
goto end

:end

                                      

Magellan

Magellan is a commercial plug-in for Adobe Acrobat. The normal demo version (obtainable from BCL Computers can only be run from within Acrobat. However they do have a command line version, a demo of which can be supplied on request or downloaded from here.

Running Magellan is a little more complicated than the Adobe filter, since it requires an input file containing the many options, including the input and output files.

There follows an example of how this may be achieved.

WARNING: The example uses a single, hard-coded file name for the conversion options. It is therefore NOT suitable for parallel indexing, or even the creation of two or more separate indexes at the same time on the same machine, as both indexing streams will attempt to use the same file. This can no doubt be worked around, but the difficulty in programming Windows command files have led me to avoid attempting this for the example.

The Magellan plug-in must be installed, and you should check that the directory where the executable is held ( C:\Program Files\BCL Computers\BCL Magellan 5 on my system) is in your PATH variable - which it is not by default (alternatively you can use the full path to the Magellan executable in the batch file mfilter.bat below).

Use the following commands when setting up your filter preference:

                                         
execute ctx_ddl.create_preference ('my_magellan_filter', 'user_filter')

 execute ctx_ddl.set_attribute ('my_magellan_filter', 'command', 'mfilter.bat')

                                      
This tells Oracle Text to use the a batch file, mfilter.bat to filter the documents.

The syntax of a call to Magellan is magellan <inputfile> where the inputfile provides the conversion options (such as whether to generate a table of contents and images) and the name of input and output files.

We must create a wrapper around the executable, which creates this input file. To do this, first copy the file stub.txt into %ORACLE_HOME\bin . This stub contains the general conversion options, and our batch file will just append the file names to this.

Now we must create the file mfilter.bat in %ORACLE_HOME\bin with the following lines in it (change the TMPDIR setting to something appropriate for your system):

                                         
REM batch file to filter pdf to html
REM a wrapper round magellan command line version

REM accepts pdf filename as first arg, html file (to create) as second arg

REM set the TMPDIR variable to a temporary directory:

set TMPDIR=c:\temp

copy %1 %tmpdir%\filterdata.pdf
copy stub.txt %TMPDIR%\maginfile.txt
echo FILE=%TMPDIR%\filterdata.pdf^&%TMPDIR%\filterdata.html >> %TMPDIR%\maginfile.txt
magellan.exe %TMPDIR%\maginfile.txt
copy %TMPDIR%\filterdata.html %2
del %TMPDIR%\filterdata.pdf
del %TMPDIR%\filterdata.html

                                      

Notes about the batch file:

  • In the current demo version of Magellan, there is a bug in the handling of the output filename. The 'base' of the filename - "filterdata" in the example above - will always be the same as the base of the input filename.
  • There is an unresolved issue where Magellan fails on any files - such as pure image files. The output from the previous conversion is used again - hence leading to spurious hits in the final application. It would seem like deleting filterdata.html before the conversion should solve the problem, but it didn't seem to do so for me. This is left for the implementor to resolve.

Filtering any Documents

Filtering most documents with INSO, and PDF only with Magellan, is certainly possible. A batch file should be created which combines the features of the expanded version of ifilter.bat
above, with the Magellan batch file mfilter.bat .

XPDF

XPF can be downloaded from foolabs . Pre-compiled binaries are available for many platforms, or the source code can be downloaded and built on any platform supporting a standard C compiler (including Gnu C).

Unlike the previous two solutions, XPDF is completely self-contained, and therefore does not require Adobe Acrobat to be installed.

XPDF seems to be successful on many files which cause problems for the INSO filter. For example, it will deal with encrypted files (though - intentionally - not copy protected ones). It is only marginally slower than Adobe's IFilter, which means it is pretty fast.

The disadvantage of XDF is that it does not deal well with columns of text, or other ways that text is interleaved for display. Consider the following two-column fragment:

                                         
                                           
 This is the text           Column2 should be
  in column one, it          dealt with as a
  should be treated          different sentence
  as a single sentence.      altogether.
                                        
                                      
The output from this in XPDF would typically be
                                         
 This is the text Column2 should be in column one, it dealt with as
  a should be treated different sentence as a single sentence. altogether.

                                      

Implementing XPDF

XPDF makes use of the gzip executable for decompression of data within the PDF file. You must have gzip available in your path. Having WinZip - which can decompress gzip files - is not sufficient.

Gzip can be downloaded in source or executable form from gzip.org.

The relevent part of XPDF for our purposes is called pdftohtml.exe. This executable actually takes arguments the same way as Oracle Text uses for its filters - so in this case there is no need to create a wrapper round the executable. However the pdftotext.exe must be copied into %ORACLE_HOME\bin .

                                         
 execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter')
  execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'pdftotext.exe')

                                      
(Note: on platforms other than Windows, the ".exe" suffix may not be required).

There are various options available for pdftotext, including options for specifying output character sets (see the XPDF documentation for more details) If these are to be used,then a wrapper procedure will be needed:

                                         
 execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter')
  execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'xfilter.bat')

                                      
And xfilter.bat contains:
                                         
 pdftotext.exe <options> %1 %2

                                      

Performance

Good performance comparisons are difficult, since the INSO filter fails to properly filter many of my test files. However, the following general conclusions can be made: If searches use only AND and OR predicates, this will not be a problem. However, if they use phrase searches, proximity searches, or WITHIN SENTENCE searches, then it may cause a problem.
  • The IFilter filter is very fast. It is as fast as INSO on small files, and very much more efficient on large files. In a test of 17 misceallaneous PDF files occupying 47MB, it took 17 seconds to create the index.
  • XPDF is slightly slower than the IFilter. On my system it took 23 seconds to index the test set as above. I have no information on performance on non-Windows systems, but imagine it will be similar.
  • The INSO filter is fast on small files, but can slow down considerably on large files (larger than 1MB). Also, it has a tendency to get stuck in a loop, which means the index run is effectively stopped until it times out.
  • The Magellan filter is much slower than any of the others for small files, and certainly slower than the IFilter on large files. This is presumably because it is doing far more work related to the layout of the document, rather than generating pure text. On the test set above, it took 90 seconds. If using Magellan, care must be taken to ensure in advance that performance is going to be adequate.

Last modified 23-January-2001 by Roger Ford
Broken links? Please email

Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve

Oracle Open World 2014 Banner