Outside In Clean ContentOutside In Clean Content addresses particularly challenging issues in native file processing. Focusing specifically on widely used formats (Microsoft Office and PDF), its extended extraction provides all text, properties, hidden information and system data emedded in native files. Its extended extraction includes the ability to analyze and process malformed documents, which is critical to accurate text extraction from PDFs. Clean Content can also programmatically modify native files enabling features such as scrubbing, property modification and document assembly. Outside In Clean Content is a pure Java technology that offers Java, C/C++ and .NET APIs.

  • Extracts text, metadata and hidden information from Microsoft Office (Word, Excel and PowerPoint, versions 97-2010) and PDF documents
  • Identifies, reports and optionally removes or modifies more than 40 metadata and hidden data elements
  • Bursts and reassembles slides from multiple PowerPoint presentations
  • Provides accurate text offset information to automate native search hit-highlighting of PDFs in Adobe Reader
  • Architected for high document throughput required by the most performance sensitive environments
  • Easy integration via a Java API for Java or any Java compatible environment like JSP and J2EE, or via a C/C++ or .NET APIs for integration with traditional languages
  • No Microsoft Office dependency eliminating the reliability, scalability and platform dependency issues that arise when automating Office applications to process files in high volumes
  • Available on Windows with Java and C/C++ and .NET interfaces, on Linux x86 with Java and C/C++ interfaces, and on Solaris SPARC with a Java interface. Supported on any Java 1.5 or above compliant JVM

