www.codinghillbilly.com   kyle.baley.org  Subscribe / Contact
 
 
 
 
LATEST POSTS
Wednesday, September 16, 2009

The word “and” has always bugged me. I hate started sentences with it but sometimes can’t help myself. Whenever I have a list of three or more items in a sentence, I can never tell whether I should be a comma before the “and” separating the last two items. And Plus it causes me no end of grief in search interfaces.

The impetus behind my most recent foray into Lucene.NET was one query phrase in particular:

research and development

Specifically, I want to be able to find documents that contained this phrase in it. Not both the word “research” AND the word “development” but the phrase “research and development”. And Also, preferably it would return documents that contained “research & development” or, if you *really* want to impress someone, “r&d”.

In the spirit of my search term, I’ve been doing some research and development to try to figure this little query out. More of the latter at first but it’s been increasingly obvious that to do a decent search interface, you also need plenty of the former. To that end, this will be a typically epic free-form meandering of my process with my usual caveat: If any of this is useful to you, that’s not my fault. I won’t go into much detail here because: a) Simone Chiaretta will almost certainly cover it shortly (if he hasn’t already), and b) there is plenty of documenta---…actually, that’s not true. Oh well, I’m still not covering the inner workings of parsers and analyzers.

QueryParser.Parse

By all accounts, QueryParser is the class to use when dealing with user-entered input. You can use a fairly easy-to-learn syntax and let Lucene handle the heavy lifting of whether to search for an entire phrase or individual words. It also includes a way of parsing AND, OR, or NOT.

This has the most appeal to me for obvious reasons so it was the one I settled on first. Then came “research and development”. Searching for the phrase with quotes around it came back with false positives. I.e. documents that contained either research or development. So I halted the development and started some research.

StandardAnalyzer

This led to much reading about Analyzers. (And So I’ll echo many others’ sentiments by recommending the book, Lucene in Action, which has been a great resource.) I started out indexing and searching with the StandardAnalyzer. But this has a couple of side effects. For one, when indexing, it strips out common stop words, like the, a, and an. And As well as and.

On the search side, it will also do some parsing of the query phrase when you use it with QueryParser.Parse. In short, when you search for “research and development” (with quotes) using a StandardAnalyzer, the query is parsed to the following:

contents:”research development”

I.e. The and is taken out of the search phrase altogether. Not quite what I had in mind so a new track was laid.

SimpleAnalyzer

The SimpleAnalyzer indexes everything. Every word (and every position of every word if you tell it to). Obviously, the size of your index will grow considerably. In my testing ground, it quintupled in size from 19Mb to 100Mb based on 1600-odd Word and PDF documents.

On the search side, if you use a SimpleAnalyzer with the QueryParser, it does correctly identify the phrase “research and development” when you include it in quotes. So all appears happy and good…

…except that it doesn’t handle “r&d” (with or without the quotes) very well. The query is reduced to:

contents:”r d”

I.e. Find all documents with the letter r and the letter d as individual letters in them. Which, truth be told, isn’t such a bad thing on the surface. It means we’ll catch not only documents containing “r&d” but also those containing “R & D”. But by the same token (pun intended), it will also match documents containing “R. Buford D. Justice”

PhraseQuery

Another option I looked at was the PhraseQuery. If you use this, it will always search for the exact phrase. None o’ this “research development” or “r d” nonsuch.

But here, the analyzers come into play as well. If I search for “research and development”, that means the word and needs to be indexed. Which implies a SimpleAnalyzer during indexing. If I search for “r&d”, the SimpleAnalyzer won’t work because it breaks up words separated by ampersand.

From here…

That brings everyone up to speed to where I am now. I’ve posed the question on StackOverflow (my first!) and at the moment, the only answer to it suggests I write my own analyzer, one that acts like the StandardAnalyzer but doesn’t throw out the word and. That sounds reasonable to me, at least until someone searches for “research or development”.

Another option I’m considering is to tell the indexer to index specific phrases like “research and development” or “oil and gas” or other common ones used in the domain. Not sure I like the long-term maintenance of either option but search is a journey, not a destination, I suppose.

There’s a fundamental argument buried in here somewhere. Lucene gives you so much control over your indexing/searching that if you’re one of those Type A’s that can’t stand when something is just “good enough”, you can very easily drive yourself up the wall trying to optimize things. It really does require you to put some thought into how users will use your search. As much as Microsoft Indexing Services allowed me to throw up a search interface haphazardly, I believe you do yourself an injustice by not considering the ins and outs of the process.

 

By the way, a couple people asked about the code I used to extract text from Word and PDF docs. Lovingly provided by the venerable (and I hope I used the right word there) Brian Donahue, the relevant classes are attached in their entirety. The only thing different about this code snippet compared to others I’ve been sent in the past is that this one worked out of the box with absolutely no help from me. Seriously, I can’t even tell you what the internal method names are, that’s how little I looked at it. Call Parser.Parse(filename) and watch the magic fly.

Kyle the Found-ational

Thursday, September 10, 2009

Long-winded background for this long-winded post can be found here and here. The short version is: I have an app that searches using a full-text search of a document repository consisting of Office docs and PDFs.

The current version uses Microsoft Indexing Service and it is all but obsolete. Which is fine with me for the time being because of the economics of the situation. Namely, the app isn’t big enough to warrant putting the effort into updating it just for the sake of the technology.

Two things happened recently though that made me decide it was time to update. First was Simone Chiaretta’s masterfully-timed tutorial series on getting started with Lucene.NET. The second was the boss discovering the current version doesn’t actually work.

By many orders of magnitude, the boss is the biggest user of this application. And recently, he went about searching for a relatively common term: R&D. He was met with a nicely formatted 500 Server Error page and asked me if I would be so bold as to fix it.

A day and a half later and I simply had no fix. The existing app uses a SQL query to search the Indexing Service which, at the time, I thought was tres clever. But after trying a dozen methods of escaping it, there was no way I could get it to accept an ampersand. Furthermore, I also discovered the page failed when including words like ‘AND’ and ‘OR’. This problem was fixable but required some parsing of the search term and again, the cost/benefit for making this sort of thing bullet-proof just wasn’t there.

So when Simone’s tutorial started coming across my RSS feed, it just made sense to re-think the problem.

The main reason I resisted moving away from Indexing Service for so long is thus: I don’t need to manage the index myself. Once configured, the only thing I needed to do to add a document to the index was drop it into a folder.

But therein lay the problem: Because I don’t manage the index, I have no control over it. During my travels, I discovered I was getting false positives for some terms and that it was not returning all documents in some cases.

How did I discover this? I implemented Lucene and compared the results. If I discovered a discrepancy and it was because of how I indexed with Lucene, well, then I tweaked the indexing process. If the discrepancy was with Indexing Services…well, then I said, “&*%$ it! You’re getting replaced!”

Another bit of fortuitousness came in the form of one Brian Donahue. One of my fears going into this was how I would get the text out of the documents in order to index it. I had waking nightmares of Windows API calls and IFilters, dreading having to deal with this. So when, after outlining my pain on Twitter, Brian responded with, and I’m paraphrasing, “Here, take this code. It will do that for you and it WORKS RIGHT OUT OF THE BOX!”. I don’t mind admitting I developed an unhealthy infatuation with him for a short time after that.

In fact, it was through the process of managing the index myself with Lucene that I discovered one reason some documents weren’t getting indexed. They were RTF documents but they had a .doc extension so it was using the wrong IFilter to capture the text. It was through Brian’s code for extracting text that I figured this out. It threw errors on some documents and I couldn’t figure out why until I tried to Save As… when working with them in Word. Change the extension to its rightful .rtf and the indexing process hummed along. But with the Indexing Service, these documents simply weren’t indexed. No error, no notification. It’s possible a message was posted to the event log but that’s a little too passive even for me.

I hope to have some more technical details but I want to wait until Simone is further along so I don’t duplicate. I’ll very likely piggy-back off a couple of his posts but to summarize: Lucene.NET rocks and should be used any time you have a button labeled “Search”, “Find”, “Locate”, or “Git it!” Fear not the index-management process because like WebForms, the problem isn’t hard enough that it needs to be abstracted.

Kyle the Co-located

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Copyright © 2010 Kyle Baley. All rights reserved.
 
CATEGORIES
.NET General (18) alt.net (4) altnetconf (9) ASP.NET AJAX (40) ASP.NET MVC (29) Bahamas (1) Bahanet (9) BDD (1) Brownfield (18) Career (9) Castle (1) Code coverage (1) Coding Style (6) Communication (1) Community (18) Conscientious Coding (34) Continuous Integration (11) dasBlog (12) Development (16) DevTeach (4) Domain (2) Environment (4) Estimating (1) Featured (14) Flamingo (10) Games (1) Google App Engine (2) GWT (5) Hardware (6) Java (1) Javascript (7) Linq (2) Livelink (6) Lucene.NET (2) MbUnit (1) Metrics (1) Miscellaneous (24) Mocking (4) NAnt (4) NHibernate (12) NInject (1) Office (3) Office Development (6) Open Rasta (1) Patterns (5) Presenting (13) Professional Development (15) Refactoring (10) ReSharper (11) REST (2) S#arp Architecture (5) Security (3) Software (11) Sundry (18) TDD (19) Tools (21) User Interface (5) Utilities (8) Visual Studio (8) VSTO (1) Web development (12) Windows (3) Working Remotely (16) Workplace (3) Writing (4)
 
LATEST POSTS
 
POPULAR POSTS
 
 
ARCHIVE