Using Alternative Filters for Filtering PDF Files
Oracle Text licences filters from Stellent Corporation (known as the INSO filters from a previous vendor and the internal name: INSO_FILTER) for filtering many file formats. Unfortunately these filters are not available for all the platforms on which Oracle Text runs, and they do occasionally have problems filtering certain PDF files.
This paper presents some alternative to the Stellent/INSO filters for users who either do not have access to them, or who do not wish to use them (for whatever reason).
Microsoft's Index Server does not support PDF filtering at all, but Adobe Systems have provided a "plug-in" filter for PDF files for use with Index Server. With a small amount of work, this same filter may be used within Oracle Text as well. This software is provided free of charge by Adobe, though it does require a full version of Adobe Acrobat to be installed ( not the free Acrobat Reader).
Alternatively, BCL Computers have a commercial product known as Magellan, which is a plug-in for Adobe Acrobat (and hence also requires the full version of Acrobat). A time and function limited demo version of Magellan is available for testing purposes, but the full version must be purchased from BCL Computers ( http://www.bclcomputers.com).
Finally, there is XPDF. This is free software licenced under the GNU Public Licence (though commercial licences are available if prefered). The great advantage of XPDF is that is runs on almost any platform, and does not require Adobe Acrobat to be installed. The disadvantage is that it is rather simplistic about text extraction, and often confuses the ordering of words.
Warnings and Limitations
The techniques described herein are not officially supported by Oracle Corporation. Customers make use of them at their own risk.
The author believes that there are no licencing issues surrounding the software described in this article. However, both downloads require you to agree to a software licence, and it is the customer's responsibility to verify the legal position with respect to these agreements.
The methods described here will only work on Windows systems. They have been tested under Windows 2000, but should work equally well on NT4.0 or Windows XP.
Whilst fast and accurate, the Adobe ifilter produces simple text output intended for indexing, not display purposes. Some effects of this are:
BCL's Magellan product produces much better formatted text, which avoids the limitations above. It can also generate web pages complete with images, but unfortunately this is difficult to use within the Oracle Text filter framework.
Adobe IFilterAs mentioned above, the Adobe IFilter is intended for use with Microsoft Index Server. It is therefore written as a DLL (shared library) which implements Microsofts IFilter COM interface. Since Oracle Text makes use of a more straightforward command-line/file interface for custom filters, we need to utilize a wrapper around the DLL. It should be possible to write such a wrapper from scratch in C++, but fortunately Microsoft have provided a debugging utility program which is exactly what we need.
DownloadsThere are two files needed for this. The first is
This file contains an installer, which will install
The second file is from Microsoft, and is a tool called filtdump.exe, originally intended as a debugging tool for Index Server. See
There is a download link at the top for
Implementing the User FilterFirst we must decide whether we need to filter any kind of documents (in which case the IFilter software will be used ONLY for PDF files, and INSO will be used for the rest), or whether we are dealing only with PDF documents. The second situation is simpler.
Filtering only PDF documentsUse the following commands when setting up your filter preference:
execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter') execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat')This tells Oracle Text to use the a batch file,
So we must create the file
copy %1 %1.pdf FiltDump.exe -b %1.pdf > %2 del %1.pdf
TestingHere is the full SQL required to test the filter we have now created. It uses the file datastore, so you MUST set the name of the PDF file you want to test, and the
/* drops will give "not found" errors on first run */ drop table xpdf; execute ctx_ddl.drop_preference ('my_file_datastore') execute ctx_ddl.drop_preference ('my_pdf_filter') create table xpdf (pk number primary key, thefile varchar2(256)); /* Change the name below to suit your document */ insert into xpdf values (1, 'book.pdf'); execute ctx_ddl.create_preference ('my_file_datastore', 'file_datastore') /* Change the path here to suit your system */ execute ctx_ddl.set_attribute ('my_file_datastore', 'path', 'e:\') execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter') execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat') create index xpdfi on xpdf (thefile) indextype is ctxsys.context parameters ('datastore my_file_datastore filter my_pdf_filter'); select * from ctx_user_index_errors;
Modified ifilter.bat for Debuggingthe following modified bat file copies the PDF input file to
REM debug version of filter file copy %1 %1.pdf copy %1.pdf e:\dbgin.pdf FiltDump.exe -b %1.pdf > %2 del %1.pdf copy %2 e:\dbgout.txt
Filtering any DocumentsTo be able to mix PDF and other types of documents, we need to be able to detect whether a document is PDF, in order to pass it to the Adobe filter, and if not we need to call the normal CTXHX executable.
To enable this, I have written a small C program which examines the first five bytes of the file to check for the string "
testForPDF.exe and put it somewhere in your
@echo off TestForPDF.exe %1 if errorlevel 2 goto failed if errorlevel 1 goto notapdf REM passed OK - this is a pdf file copy %1 %1.pdf FiltDump.exe -b %1.pdf > %2 del %1.pdf goto end :failed REM TestForPDF.exe generated an error goto end :notapdf REM No - that is not a pdf file - use ctxhx ctxhx.exe %1 %2 %3 %4 %5 goto end :end
Magellan is a commercial plug-in for Adobe Acrobat. The normal demo version (obtainable from BCL Computers can only be run from within Acrobat. However they do have a command line version, a demo of which can be supplied on request or downloaded from here.
Running Magellan is a little more complicated than the Adobe filter, since it requires an input file containing the many options, including the input and output files.
There follows an example of how this may be achieved.
WARNING: The example uses a single, hard-coded file name for the conversion options. It is therefore NOT suitable for parallel indexing, or even the creation of two or more separate indexes at the same time on the same machine, as both indexing streams will attempt to use the same file. This can no doubt be worked around, but the difficulty in programming Windows command files have led me to avoid attempting this for the example.
The Magellan plug-in must be installed, and you should check that the directory where the executable is held (
Use the following commands when setting up your filter preference:
execute ctx_ddl.create_preference ('my_magellan_filter', 'user_filter') execute ctx_ddl.set_attribute ('my_magellan_filter', 'command', 'mfilter.bat')This tells Oracle Text to use the a batch file,
The syntax of a call to Magellan is
We must create a wrapper around the executable, which creates this input file. To do this, first copy the file
REM batch file to filter pdf to html REM a wrapper round magellan command line version REM accepts pdf filename as first arg, html file (to create) as second arg REM set the TMPDIR variable to a temporary directory: set TMPDIR=c:\temp copy %1 %tmpdir%\filterdata.pdf copy stub.txt %TMPDIR%\maginfile.txt echo FILE=%TMPDIR%\filterdata.pdf^&%TMPDIR%\filterdata.html >> %TMPDIR%\maginfile.txt magellan.exe %TMPDIR%\maginfile.txt copy %TMPDIR%\filterdata.html %2 del %TMPDIR%\filterdata.pdf del %TMPDIR%\filterdata.html
Notes about the batch file:
Filtering any DocumentsFiltering most documents with INSO, and PDF only with Magellan, is certainly possible. A batch file should be created which combines the features of the expanded version of
XPF can be downloaded from foolabs . Pre-compiled binaries are available for many platforms, or the source code can be downloaded and built on any platform supporting a standard C compiler (including Gnu C).
Unlike the previous two solutions, XPDF is completely self-contained, and therefore does not require Adobe Acrobat to be installed.
XPDF seems to be successful on many files which cause problems for the INSO filter. For example, it will deal with encrypted files (though - intentionally - not copy protected ones). It is only marginally slower than Adobe's IFilter, which means it is pretty fast.
The disadvantage of XDF is that it does not deal well with columns of text, or other ways that text is interleaved for display. Consider the following two-column fragment:
The output from this in XPDF would typically be
This is the text Column2 should be in column one, it dealt with as a should be treated different sentence as a single sentence. altogether.
Implementing XPDFXPDF makes use of the
Gzip can be downloaded in source or executable form from gzip.org.
The relevent part of XPDF for our purposes is called
execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter') execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'pdftotext.exe')(Note: on platforms other than Windows, the ".exe" suffix may not be required).
There are various options available for
execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter') execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'xfilter.bat')And
pdftotext.exe <options> %1 %2
PerformanceGood performance comparisons are difficult, since the INSO filter fails to properly filter many of my test files. However, the following general conclusions can be made: If searches use only AND and OR predicates, this will not be a problem. However, if they use phrase searches, proximity searches, or WITHIN SENTENCE searches, then it may cause a problem.