The word “and” has always bugged me. I hate started sentences with it but sometimes can’t help myself. Whenever I have a list of three or more items in a sentence, I can never tell whether I should be a comma before the “and” separating the last two items. And Plus it causes me no end of grief in search interfaces.
The impetus behind my most recent foray into Lucene.NET was one query phrase in particular:
research and development
Specifically, I want to be able to find documents that contained this phrase in it. Not both the word “research” AND the word “development” but the phrase “research and development”. And Also, preferably it would return documents that contained “research & development” or, if you *really* want to impress someone, “r&d”.
In the spirit of my search term, I’ve been doing some research and development to try to figure this little query out. More of the latter at first but it’s been increasingly obvious that to do a decent search interface, you also need plenty of the former. To that end, this will be a typically epic free-form meandering of my process with my usual caveat: If any of this is useful to you, that’s not my fault. I won’t go into much detail here because: a) Simone Chiaretta will almost certainly cover it shortly (if he hasn’t already), and b) there is plenty of documenta---…actually, that’s not true. Oh well, I’m still not covering the inner workings of parsers and analyzers.
By all accounts, QueryParser is the class to use when dealing with user-entered input. You can use a fairly easy-to-learn syntax and let Lucene handle the heavy lifting of whether to search for an entire phrase or individual words. It also includes a way of parsing AND, OR, or NOT.
This has the most appeal to me for obvious reasons so it was the one I settled on first. Then came “research and development”. Searching for the phrase with quotes around it came back with false positives. I.e. documents that contained either research or development. So I halted the development and started some research.
This led to much reading about Analyzers. (And So I’ll echo many others’ sentiments by recommending the book, Lucene in Action, which has been a great resource.) I started out indexing and searching with the StandardAnalyzer. But this has a couple of side effects. For one, when indexing, it strips out common stop words, like the, a, and an. And As well as and.
On the search side, it will also do some parsing of the query phrase when you use it with QueryParser.Parse. In short, when you search for “research and development” (with quotes) using a StandardAnalyzer, the query is parsed to the following:
I.e. The and is taken out of the search phrase altogether. Not quite what I had in mind so a new track was laid.
The SimpleAnalyzer indexes everything. Every word (and every position of every word if you tell it to). Obviously, the size of your index will grow considerably. In my testing ground, it quintupled in size from 19Mb to 100Mb based on 1600-odd Word and PDF documents.
On the search side, if you use a SimpleAnalyzer with the QueryParser, it does correctly identify the phrase “research and development” when you include it in quotes. So all appears happy and good…
…except that it doesn’t handle “r&d” (with or without the quotes) very well. The query is reduced to:
I.e. Find all documents with the letter r and the letter d as individual letters in them. Which, truth be told, isn’t such a bad thing on the surface. It means we’ll catch not only documents containing “r&d” but also those containing “R & D”. But by the same token (pun intended), it will also match documents containing “R. Buford D. Justice”
Another option I looked at was the PhraseQuery. If you use this, it will always search for the exact phrase. None o’ this “research development” or “r d” nonsuch.
But here, the analyzers come into play as well. If I search for “research and development”, that means the word and needs to be indexed. Which implies a SimpleAnalyzer during indexing. If I search for “r&d”, the SimpleAnalyzer won’t work because it breaks up words separated by ampersand.
That brings everyone up to speed to where I am now. I’ve posed the question on StackOverflow (my first!) and at the moment, the only answer to it suggests I write my own analyzer, one that acts like the StandardAnalyzer but doesn’t throw out the word and. That sounds reasonable to me, at least until someone searches for “research or development”.
Another option I’m considering is to tell the indexer to index specific phrases like “research and development” or “oil and gas” or other common ones used in the domain. Not sure I like the long-term maintenance of either option but search is a journey, not a destination, I suppose.
There’s a fundamental argument buried in here somewhere. Lucene gives you so much control over your indexing/searching that if you’re one of those Type A’s that can’t stand when something is just “good enough”, you can very easily drive yourself up the wall trying to optimize things. It really does require you to put some thought into how users will use your search. As much as Microsoft Indexing Services allowed me to throw up a search interface haphazardly, I believe you do yourself an injustice by not considering the ins and outs of the process.
By the way, a couple people asked about the code I used to extract text from Word and PDF docs. Lovingly provided by the venerable (and I hope I used the right word there) Brian Donahue, the relevant classes are attached in their entirety. The only thing different about this code snippet compared to others I’ve been sent in the past is that this one worked out of the box with absolutely no help from me. Seriously, I can’t even tell you what the internal method names are, that’s how little I looked at it. Call Parser.Parse(filename) and watch the magic fly.
Kyle the Found-ational