Introduction to Textract
SourceForge.net Logo

Introduction to Textract

For IT managers, programmers, and Computer Science students

There is a world of opportunity out there -- organizations that need help in accessing their data, and perhaps in sharing their data with others. Here are tools you can use to help them (and to make money doing it).

Over one hundred of Marpex Inc. analysis, filtering, text extraction, and formatting programs -- plus innumerable functions and utilities -- are offered to the open source community. This collection of source code will be progressively documented and loaded into the "textract" project at SourceForge over the months ahead. The entire set is available effective December 1, 2008 as examples of search code technology applied to source code: The GNU General Public License encourages you to further develop and to share any of these programs and functions.

Intro to the source code

These programs and functions were prepared in Microsoft Visual Studio C++. Those who know and love C++ will recognize quickly that the Marpex code is closer to glorified C language. It is not for a moment represented as true unadulterated C++. But, by and large, the code works and works efficiently. Threaded techniques are not used; that's a negative. The Microsoft .NET framework has been totally avoided; that's a strong plus. (Need I say why?)

You will find programs and functions to detect patterns in document files, to extract text from files, and to format and tag text. All this was written for one purpose: to gather text in a format suitable for indexing. My interest is in search; I created the FindIt search engine in 1984-85, and have been working for the last several years on another engine based on a "Method and System for Compression Indexing and Efficient Proximity Search of Text Data". I submitted the patent application in March 2004, and U.S. Patent 7,433,893 was issued on October 7, 2008.

Any search engine can serve people better if it is supported by an extensive and widely available set of preparation tools. Therefore I have released many of my preparation programs and utility functions to the open source community. The amount of text data in this world is growing exponentially. That data needs to be set up for indexing and search. If you sense an opportunity for entrepreneurship here, that's good! If you like what you see at http://www.WordsCloseTogether.com, and want to make that your search engine of choice, so much the better!

Analyzing document types

If you are after hacking tools, there is nothing here to help in reverse engineering programs.

If you want tools to analyze files that are produced by software programs, the source code here may be helpful.

In case the distinction escapes some firm's legal department: Software programs are typically owned by the firm that produce them; the program is licensed and ownership does not pass. However, the user owns the documents that he or she creates with the licensed program. It would be an interesting world indeed if, for example, Microsoft were to claim ownership of every .doc and .docx file ever created using Microsoft Word. One has visions of judges and lawyers being ordered to surrender their documents to Microsoft!

Porting to Unix and other environments

Many of the programs here are console routines. Yes, DOS is still on the great majority of personal computers; in Windows, click Run, type in "cmd.exe", and you are there. DOS is not a regime to be imposed on the unGeek world, but it does make for quick programming and even quicker ports to other environments.

For the most part, porting to Unix involves reverting from the Microsoft buffer-protecting functions (sprintf_s, _strnicmp, fopen_s, etc.) to their not-so-secure predecessors. A few minutes of editing is often all that is needed.

A few of the programs here have a CHtmlView interface, and ride on functions from Internet Explorer. I passed samples of these to a local high school senior and he was able to port an extensive program to Unix in well under a day. (Mike Roberts, stand up and take a bow.)

Thanks for your interest.

Douglas Lowry, Ph.D.
President
Marpex Inc.
Steubenville, Ohio 43952-1438