Lightning Searching for Text Data -- Redmondmag.com

Product Reviews

Lightning Searching for Text Data

Search your own system or use the dtSearch engine in your products.

By Mike Gunderloy
12/05/2001

dtSearch makes a fast search engine for text data that's been around for over a decade now. I took a look at the Desktop member of their product line, which also includes a network version, a version to add searching to your web site, and a developer version of the search engine (more on that later). To use dtSearch, you choose the data that you're interested in and turn it loose to build its own index. Indexing about 1.5 gigabytes of stuff, including lots of files, two huge Outlook stores, and a couple of web sites, took just about exactly ten hours on a fast machine. The index was nearly the size of the source data when the software finished building it.

The difference, of course, is that finding things with the index is infinitely faster than finding them without it. A search on "guinea fowl" on my desktop, for example, pulls out 48 documents containing those silly birds from the nearly 100,000 that I indexed in less than a second. The dtSearch Desktop interface than allows browsing through the found documents, displaying them in its own interface or letting you launch external viewers, with the search text highlighted. Supported search options include Boolean, stemmed, fuzzy, synonym, phonic, phrase, and "near" searches. dtSearch can also search unindexed documents, though this slows it down substantially.

You can build multiple indexes and search them all at once with the FindPlus feature, which also enables a desktop user to make use of a network index for additional searching. This opens the possibility of distributed search indexing. The program understands quite a few file formats, having no trouble pulling information out of Word, Excel, Access, or Outlook files, as well as common formats such as RTF or PDF. I looked for, but could not find, a complete list of supported formats. You'll also want to use some care in deciding what to index. By default, the indexer uses a list of file extensions to decide what NOT to index, but the default list is hardly complete -- it doesn't block any of the common video file formats, for example. The result can be an index full of nonsense words made from traipsing through binary file formats. You can extend the list of blocked extensions yourself, supply your own list of specific extensions to index while ignoring everything else (this is where the list of supported formats would have come in handy), or organize your hard drive to keep documents separated from other stuff.

I also took a look at the dtSearch engine from a programmer's point of view. You can incorporate dtSearch's index and search technology within your own application through either a C++ API or through supplied ActiveX objects. Either way you have access to the entire range of indexing and searching functionality. There are a variety of ways to license the engine, including a single server license ($999), royalty-based licenses starting at $2,500 or royalty-free licenses starting at $9,995. The sample code that I looked at worked well from VB.

If you're buried in documents and need to find things quickly, and have plenty of hard drive space, dtSearch Desktop offers a straightforward interface and impressive speed. If you need wide-ranging search capabilities in your own application, their Text Retrieval Engine package is definitely worth considering.

About the Author

Mike Gunderloy, MCSE, MCSD, MCDBA, is a former MCP columnist and the author of numerous development books.

Featured

Microsoft Kills 30-Day Data Retention Policy for Copilot in Fabric

Microsoft this week announced several usability improvements coming to Copilot in Fabric in preparation of its general availability release later this year.
Using Microsoft Office to Build a Network Diagram 2

Now that we've set up our process, let's dig in with the actual execution.
Microsoft Graph 'Activity Logs' Feature Goes Live

Microsoft announced the general availability of the "activity logs" capability in Microsoft Graph, giving administrators more options to track user activity and identify patterns of potential misuse.
Fallout from Microsoft's 'Midnight Blizzard' Saga Hits Feds

When the Russian attack group Midnight Blizzard successfully breached Microsoft corporate e-mail accounts late last year, it apparently managed to steal U.S. government agency e-mails, too.
PowerShell Script Used in Phishing Attack May Be AI-Generated

A PowerShell script being used in a novel malware campaign may have been created by AI, according to security researchers at Proofpoint.