Using Alternative Filters for Filtering PDF Files
Contents
Introduction
Oracle Text licences filters from Stellent Corporation (known as
the INSO filters from a previous vendor and the internal name: INSO_FILTER) for
filtering many file formats. Unfortunately these filters are not available
for all the platforms on which Oracle Text runs, and they do occasionally
have problems filtering certain PDF files.
This paper presents some alternative to the Stellent/INSO filters
for users who either do not have access to them, or who do not wish to
use them (for whatever reason).
Microsoft's Index Server does not support PDF filtering at all, but
Adobe Systems have provided a "plug-in" filter for PDF files for use
with Index Server. With a small amount of work, this same filter may
be used within Oracle Text as well. This software is provided free of
charge by Adobe, though it does require a full version of Adobe Acrobat
to be installed (not the free Acrobat Reader).
Alternatively, BCL Computers have a commercial product known as
Magellan, which is a plug-in for Adobe Acrobat (and hence also requires
the full version of Acrobat). A time and function limited demo version
of Magellan is available for testing purposes, but the full version must be
purchased from BCL Computers (http://www.bclcomputers.com).
Finally, there is XPDF. This is free software licenced under the
GNU Public Licence (though commercial licences are available if prefered).
The great advantage of XPDF is that is runs on almost any platform,
and does not require Adobe Acrobat to be installed. The disadvantage is
that it is rather simplistic about text extraction, and often confuses
the ordering of words.
Warnings and Limitations
The techniques described herein are not officially supported by Oracle
Corporation. Customers make use of them at their own risk.
The author believes that there are no licencing issues surrounding
the software described in this article. However, both downloads require
you to agree to a software licence, and it is the customer's
responsibility to verify the legal position with respect to these
agreements.
Limitations
The methods described here will only work on Windows systems. They
have been tested under Windows 2000, but should work equally well on
NT4.0 or Windows XP.
Whilst fast and accurate, the Adobe ifilter produces simple text output
intended for indexing, not display purposes. Some effects of this are:
- No "within paragraph" searches - all the text is run together
and paragraph breaks are not visible.
- No pretty layout, so not very suitable for display purposes.
BCL's Magellan product produces much better formatted text, which avoids
the limitations above. It can also generate web pages complete with images,
but unfortunately this is difficult to use within the Oracle Text filter
framework.
Adobe IFilter
As mentioned above, the Adobe IFilter is intended for use with
Microsoft Index Server. It is therefore written as a DLL (shared
library) which implements Microsofts IFilter COM interface. Since
Oracle Text makes use of a more straightforward command-line/file
interface for custom filters, we need to utilize a wrapper around the
DLL. It should be possible to write such a wrapper from scratch in
C++, but fortunately Microsoft have provided a debugging utility program
which is exactly what we need.
Downloads
There are two files needed for this. The first is PDFFilt.dll, available
from Adobe Corporation at
http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
This file contains an installer, which will install PDFFilt.dll into
a directory, by default C:\Program Files\Adobe\PDF IFilter 5.0
You must ensure that your PATH variable includes this directory, or else
copy the DLL file to somewhere on your path, such as C:\WINNT.
The second file is from Microsoft, and is a tool called filtdump.exe,
originally intended as a debugging tool for Index Server. See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnindex/html/msdn_is-index.asp
There is a download link at the top for 5180.exe - a self-extracting
zip archive containing FiltDump.exe and FiltReg.exe. We only need the
first of these files. Extract FiltDump.exe and put it somewhere in your PATH
- I suggest %ORACLE_HOME\bin. Note that the archive has a path name of "5180" coded into it already -
WinZip will put it into a directory with that name under the directory where
you extract the file - you will need to rename it afterwards.
Implementing the User Filter
First we must decide whether we need to filter any kind of documents
(in which case the IFilter software will be used ONLY for PDF files,
and INSO will be used for the rest), or whether we are dealing only
with PDF documents. The second situation is simpler.
Filtering only PDF documents
Use the following commands when setting up your filter preference:
execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat')
This tells Oracle Text to use the a batch file, ifilter.bat to filter
the documents. The actual filter program requires that the file to be filtered
has a .PDF extension - and this is not normally true of the temporary
files generated by Oralce Text for filtering. So our batch file must copy the file
to a temporary file with a .PDF extension, then invoke the actual filter.
So we must create the file ifilter.bat in %ORACLE_HOME\bin with
the following lines in it:
copy %1 %1.pdf
FiltDump.exe -b %1.pdf > %2
del %1.pdf
Testing
Here is the full SQL required to test the filter we have now created.
It uses the file datastore, so you MUST set the name of the PDF
file you want to test, and the 'path'
setting in the set_attribute statement to reflect the location
of your file.
/* drops will give "not found" errors on first run */
drop table xpdf;
execute ctx_ddl.drop_preference ('my_file_datastore')
execute ctx_ddl.drop_preference ('my_pdf_filter')
create table xpdf (pk number primary key, thefile varchar2(256));
/* Change the name below to suit your document */
insert into xpdf values (1, 'book.pdf');
execute ctx_ddl.create_preference ('my_file_datastore', 'file_datastore')
/* Change the path here to suit your system */
execute ctx_ddl.set_attribute ('my_file_datastore', 'path', 'e:\')
execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'ifilter.bat')
create index xpdfi on xpdf (thefile) indextype is ctxsys.context
parameters ('datastore my_file_datastore filter my_pdf_filter');
select * from ctx_user_index_errors;
Debugging
- remember to check CTX_USER_INDEX_ERRORS
- select token_text from
dr$xpdfi$i to see if anything has been indexed
Modified ifilter.bat for Debugging
the following modified bat file copies the PDF input file to
e:\dbgin.pdf and the plain text output to e:\dbgout.txt
if everything worked succesfully. You will probably need to change E:\
to point to your own temporary directory. This way, you can check that the
filter is getting a valid PDF file by attempting to open e:\dbgin.pdf,
and you can try running FiltDump.exe from the command line on
the same file.
REM debug version of filter file
copy %1 %1.pdf
copy %1.pdf e:\dbgin.pdf
FiltDump.exe -b %1.pdf > %2
del %1.pdf
copy %2 e:\dbgout.txt
Filtering any Documents
To be able to mix PDF and other types of documents, we need to be able to detect whether
a document is PDF, in order to pass it to the Adobe filter, and if not we need to
call the normal CTXHX executable.
To enable this, I have written a small C program which examines the first five bytes
of the file to check for the string "%PDF-" which all PDF files contain. If it finds
it, it returns a value of 0, and if it does not it returns 1 (2 indicates an error
condition such as "file not found").
Download testForPDF.exe
and put it somewhere in your PATH, such as %ORACLE_HOME\bin.
(if you're interested, the source code is
here).
Use the same setup as above, but this time the ifilter.bat should look like this:
@echo off
TestForPDF.exe %1
if errorlevel 2 goto failed
if errorlevel 1 goto notapdf
REM passed OK - this is a pdf file
copy %1 %1.pdf
FiltDump.exe -b %1.pdf > %2
del %1.pdf
goto end
:failed
REM TestForPDF.exe generated an error
goto end
:notapdf
REM No - that is not a pdf file - use ctxhx
ctxhx.exe %1 %2 %3 %4 %5
goto end
:end
Magellan
Magellan is a commercial plug-in for Adobe Acrobat. The normal demo
version (obtainable from BCL Computers
can only be run from within Acrobat. However they do have a command line
version, a demo of which can be supplied on request or downloaded from
here.
Running Magellan is a little more complicated than the Adobe filter, since
it requires an input file containing the many options, including the input
and output files.
There follows an example of how this may be achieved.
WARNING:The example uses a single, hard-coded
file name for the conversion options. It is therefore NOT suitable for parallel
indexing, or even the creation of two or more separate indexes at the same
time on the same machine, as both indexing streams will attempt to use the
same file. This can no doubt be worked around, but the difficulty in
programming Windows command files have led me to avoid attempting this for
the example.
The Magellan plug-in must be installed, and you should check that
the directory where the executable is held (C:\Program Files\BCL
Computers\BCL Magellan 5 on my system) is in your
PATH variable - which it is not by default (alternatively
you can use the full path to the Magellan executable in the batch file
mfilter.bat below).
Use the following commands when setting up your filter preference:
execute ctx_ddl.create_preference ('my_magellan_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_magellan_filter', 'command', 'mfilter.bat')
This tells Oracle Text to use the a batch file, mfilter.bat to filter
the documents.
The syntax of a call to Magellan is magellan <inputfile>
where the inputfile provides the conversion options (such as whether to generate
a table of contents and images) and the name of input and output files.
We must create a wrapper around the executable, which creates this input file.
To do this, first copy the file stub.txt into
%ORACLE_HOME\bin. This stub contains the general conversion options,
and our batch file will just append the file names to this.
Now we must create the file mfilter.bat in %ORACLE_HOME\bin with
the following lines in it (change the TMPDIR setting to something appropriate for your
system):
REM batch file to filter pdf to html
REM a wrapper round magellan command line version
REM accepts pdf filename as first arg, html file (to create) as second arg
REM set the TMPDIR variable to a temporary directory:
set TMPDIR=c:\temp
copy %1 %tmpdir%\filterdata.pdf
copy stub.txt %TMPDIR%\maginfile.txt
echo FILE=%TMPDIR%\filterdata.pdf^&%TMPDIR%\filterdata.html >> %TMPDIR%\maginfile.txt
magellan.exe %TMPDIR%\maginfile.txt
copy %TMPDIR%\filterdata.html %2
del %TMPDIR%\filterdata.pdf
del %TMPDIR%\filterdata.html
Notes about the batch file:
- In the current demo version of Magellan, there is a bug in the handling of the
output filename. The 'base' of the filename - "filterdata" in the example above -
will always be the same as the base of the input filename.
- There is an unresolved issue where Magellan fails on any files - such as pure
image files. The output from the previous conversion is used again - hence leading
to spurious hits in the final application. It would seem like deleting
filterdata.html before the conversion should solve the problem,
but it didn't seem to do so for me. This is left for the implementor to resolve.
Filtering any Documents
Filtering most documents with INSO, and PDF only with Magellan, is certainly possible.
A batch file should be created which combines the features of the expanded version
of ifilter.bat above, with the Magellan
batch file mfilter.bat.
XPDF
XPF can be downloaded from foolabs. Pre-compiled binaries are available for many platforms,
or the source code can be downloaded and built on any platform supporting
a standard C compiler (including Gnu C).
Unlike the previous two solutions, XPDF is completely self-contained, and
therefore does not require Adobe Acrobat to be installed.
XPDF seems to be successful on many files which cause problems for the
INSO filter. For example, it will deal with encrypted files (though -
intentionally - not copy protected ones). It is only marginally slower
than Adobe's IFilter, which means it is pretty fast.
The disadvantage of XDF is that it does not deal well with columns of
text, or other ways that text is interleaved for display.
Consider the following two-column fragment:
This is the text Column2 should be
in column one, it dealt with as a
should be treated different sentence
as a single sentence. altogether.
The output from this in XPDF would typically be
This is the text Column2 should be in column one, it dealt with as
a should be treated different sentence as a single sentence. altogether.
Implementing XPDF
XPDF makes use of the gzip executable for decompression
of data within the PDF file. You must have gzip available
in your path. Having WinZip - which can decompress gzip files -
is not sufficient.
Gzip can be downloaded in source or executable form from
gzip.org.
The relevent part of XPDF for our purposes is called
pdftohtml.exe. This executable actually takes
arguments the same way as Oracle Text uses for its filters - so
in this case there is no need to create a wrapper round the executable.
However the pdftotext.exe must be copied into %ORACLE_HOME\bin.
execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'pdftotext.exe')
(Note: on platforms other than Windows, the ".exe" suffix may not be
required).
There are various options available for pdftotext,
including options for specifying output character sets (see the XPDF
documentation for more details) If these are to be used,then a wrapper
procedure will be needed:
execute ctx_ddl.create_preference ('my_xpdf_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_xpdf_filter', 'command', 'xfilter.bat')
And xfilter.bat contains:
pdftotext.exe <options> %1 %2
Performance
Good performance comparisons are difficult, since the INSO filter fails to
properly filter many of my test files. However, the following general conclusions
can be made:
If searches use only AND and OR predicates, this will not be a
problem. However, if they use phrase searches, proximity searches,
or WITHIN SENTENCE searches, then it may cause a problem.
-
The IFilter filter is very fast. It is as fast as INSO on small files,
and very much more efficient on large files. In a test of 17
misceallaneous PDF files occupying 47MB, it took 17 seconds to create
the index.
-
XPDF is slightly slower than the IFilter. On my system it took 23
seconds to index the test set as above. I have no information on
performance on non-Windows systems, but imagine it will be similar.
-
The INSO filter is fast on small files, but can slow down considerably
on large files (larger than 1MB). Also, it has a tendency to get stuck
in a loop, which means the index run is effectively stopped until it
times out.
-
The Magellan filter is much slower than any of the others for small
files, and certainly slower than the IFilter on large files. This is
presumably because it is doing far more work related to the layout of
the document, rather than generating pure text. On the test set above,
it took 90 seconds. If using Magellan, care must be taken to ensure in
advance that performance is going to be adequate.
Last modified 23-January-2001 by Roger Ford
Broken links? Please email
|