Lightning Searching for Text Data -- Redmondmag.com

Product Reviews

Lightning Searching for Text Data

Search your own system or use the dtSearch engine in your products.

By Mike Gunderloy
12/05/2001

dtSearch makes a fast search engine for text data that's been around for over a decade now. I took a look at the Desktop member of their product line, which also includes a network version, a version to add searching to your web site, and a developer version of the search engine (more on that later). To use dtSearch, you choose the data that you're interested in and turn it loose to build its own index. Indexing about 1.5 gigabytes of stuff, including lots of files, two huge Outlook stores, and a couple of web sites, took just about exactly ten hours on a fast machine. The index was nearly the size of the source data when the software finished building it.

The difference, of course, is that finding things with the index is infinitely faster than finding them without it. A search on "guinea fowl" on my desktop, for example, pulls out 48 documents containing those silly birds from the nearly 100,000 that I indexed in less than a second. The dtSearch Desktop interface than allows browsing through the found documents, displaying them in its own interface or letting you launch external viewers, with the search text highlighted. Supported search options include Boolean, stemmed, fuzzy, synonym, phonic, phrase, and "near" searches. dtSearch can also search unindexed documents, though this slows it down substantially.

You can build multiple indexes and search them all at once with the FindPlus feature, which also enables a desktop user to make use of a network index for additional searching. This opens the possibility of distributed search indexing. The program understands quite a few file formats, having no trouble pulling information out of Word, Excel, Access, or Outlook files, as well as common formats such as RTF or PDF. I looked for, but could not find, a complete list of supported formats. You'll also want to use some care in deciding what to index. By default, the indexer uses a list of file extensions to decide what NOT to index, but the default list is hardly complete -- it doesn't block any of the common video file formats, for example. The result can be an index full of nonsense words made from traipsing through binary file formats. You can extend the list of blocked extensions yourself, supply your own list of specific extensions to index while ignoring everything else (this is where the list of supported formats would have come in handy), or organize your hard drive to keep documents separated from other stuff.

I also took a look at the dtSearch engine from a programmer's point of view. You can incorporate dtSearch's index and search technology within your own application through either a C++ API or through supplied ActiveX objects. Either way you have access to the entire range of indexing and searching functionality. There are a variety of ways to license the engine, including a single server license ($999), royalty-based licenses starting at $2,500 or royalty-free licenses starting at $9,995. The sample code that I looked at worked well from VB.

If you're buried in documents and need to find things quickly, and have plenty of hard drive space, dtSearch Desktop offers a straightforward interface and impressive speed. If you need wide-ranging search capabilities in your own application, their Text Retrieval Engine package is definitely worth considering.

About the Author

Mike Gunderloy, MCSE, MCSD, MCDBA, is a former MCP columnist and the author of numerous development books.

Featured

Microsoft Unveils Project Perception, Expands Runtime Security for AI Agents

Microsoft is pairing autonomous cyberdefense with new Defender protections designed to detect and block risky AI agent activity as it happens.
Why Azure SQL Database Hyperscale Is Not Just for Massive Workloads

Hyperscale combines strong write performance, flexible storage and fast replica creation for databases of nearly any size.
HOLLOWGRAPH Malware Turns Microsoft 365 Calendars Into Covert Attack Channels

The targeted espionage tool hides commands and stolen files inside calendar events while using legitimate Microsoft cloud traffic to evade detection.
Enterprise AI Agents Outpace the Content and Governance Systems Behind Them

AI agents have quickly moved into mainstream enterprise use, but the content infrastructure needed to support them has struggled to keep up, according to a new survey-based report from cloud content management company Box.
Phishing Isn't an Email Problem Anymore - It's an Identity Problem

Security teams have invested heavily in email protection, endpoint security and identity controls, but Fortra's latest research suggests one challenge remains difficult to solve: users.