www.codinghillbilly.com   kyle.baley.org  Subscribe / Contact
 
 
 
 
LATEST POSTS
Thursday, September 24, 2009

The one year anniversary of when the TeamCity/CodeBetter collaboration was brought online. The reason I know this is because the product has a helpful feature whereby it warns you quite clearly at the top of the page when your license is about to expire. Several others noticed this and Jayzus bless Twitter for keeping those kind reminders out of my inbox.

At last count, we had 49 projects with 77 build configurations. I always like browsing through each one’s website to see what people are working on. Three that caught my fancy that I hadn’t heard of previously are HORN, UppercuT, and crap4n. By the way, I really do mean “last count” because as far as I can tell, there is no way to export a list of projects and/or configurations (see my wish list later in this post).

There have been some growing pains of course but for the most part, things are running smoothly now. Over the last year, we’ve added two more agents, Silverlight support, and Git support (through a community plug-in; official version is slated for November). We’re in the process of installing a version of NCover that doesn’t require an administrator to run (thanks to Stephen, Daniel, Joe, and Peter at NCover for their work on this).

A couple of projects haven’t made it up yet because of edge-case requirements. XNA support still eludes us as that requires more hoops than any of us have had time. I’m also looking through the list and noticed one that got me excited which is hopefully going to go up soon.

After working with TeamCity for a large number of projects, I’m generally very happy with it. It was clearly not designed to manage a public portal of OSS projects like this but it’s held up well to the challenge. That said, I do have a wish list of features for an “OSS Portal” edition of the product:

  • Ability to sort projects by name on the main page.
  • Ability to collapse projects in the admin page.
  • A proper URL field so I don’t have to embed it in the project’s description. A spot for a logo would be nice too.
  • Quick navigation to a project/configuration. Maybe some shortcut keys or a filter box at the top. The dropdown list under the Projects tab has one. Be nice if the page itself did too.
  • Easy way to notify all project administrators of upcoming maintenance
  • Display project administrators somewhere prominent for each project on the main page and on the admin page
  • Configure Visible Projects: C’mon, JetBrains. You’re not Microsoft. Use checkboxes, not that goofy dual ListBox dialog.
  • Stats or report page showing a list of projects. Optionally, it would show each build configuration for each project and the last time that configuration was run.
    • Alternatively, they could just add a print stylesheet to the main page because all the info I need is there, except the project administrator. It’s just not conducive to inserting into blog posts…
  • This is a stretch but I’d like to see a custom project template (Note: NOT the “copy project” feature). Something like a “Standard SVN” project template where all it asks for is the project name, the SVN URL, and the project administrator. Then it would create a project and a default build configuration with pre-selected settings and all I’d need to do is put in the build configuration.
  • If it hasn’t been fixed yet, a tray notifier that lets you connect to more than one server. (Someone wasn’t doing their homework when they didn’t include this capability from version 1.)
  • Easier way to configure notifications. I.e. from the main page, click “Watch this project” or “Watch this configuration”

Looking forward to next year’s recap.

Kyle the Wrapped-Up

Thursday, September 17, 2009

Quick productivity blorg today. Because if hillbillies are known for anything, it’s their efficacy. And their ability to use words they don’t quite know the meaning of but can fake in context. (I’m referring, of course, to “efficacy”. I know full well what a “blorg” is so I don’t need you telling me in the comments.)

Alt-F4 is not the most natural keystroke on the Kinesis keyboard. Or even on a regular keyboard. Yet I use it pretty often. Especially recently while testing out Lucene on my document repository. Word and PDF docs abound very quickly while I’m opening them and verifying results. So I was looking for a faster way of closing them quickly.

My chosen method: Press Ctrl three times in rapid succession to close the active window.

The implementation: AutoHotKey. Here is the script that does it:

~Ctrl::
CloseOnThird()
return

CloseOnThird() {

   Static Count
   key := RegExReplace(A_ThisHotKey,"[\*\~\$\#\+\!\^( UP)]")
   If ( A_ThisHotKey = A_PriorHotKey and A_TimeSincePriorHotkey < 400 )
        Count += Count < 3 ? 1 : 0
   Else Count = 1
   KeyWait %key%, DT0.4
   If (ErrorLevel and Count = 3)
      WinClose,A
}

The end result: at the end of a strenuous day, I can mash the Ctrl key over and over again until the computer shuts down. Tres satisfying.

Kyle the Ctrl’d

Wednesday, September 16, 2009

The word “and” has always bugged me. I hate started sentences with it but sometimes can’t help myself. Whenever I have a list of three or more items in a sentence, I can never tell whether I should be a comma before the “and” separating the last two items. And Plus it causes me no end of grief in search interfaces.

The impetus behind my most recent foray into Lucene.NET was one query phrase in particular:

research and development

Specifically, I want to be able to find documents that contained this phrase in it. Not both the word “research” AND the word “development” but the phrase “research and development”. And Also, preferably it would return documents that contained “research & development” or, if you *really* want to impress someone, “r&d”.

In the spirit of my search term, I’ve been doing some research and development to try to figure this little query out. More of the latter at first but it’s been increasingly obvious that to do a decent search interface, you also need plenty of the former. To that end, this will be a typically epic free-form meandering of my process with my usual caveat: If any of this is useful to you, that’s not my fault. I won’t go into much detail here because: a) Simone Chiaretta will almost certainly cover it shortly (if he hasn’t already), and b) there is plenty of documenta---…actually, that’s not true. Oh well, I’m still not covering the inner workings of parsers and analyzers.

QueryParser.Parse

By all accounts, QueryParser is the class to use when dealing with user-entered input. You can use a fairly easy-to-learn syntax and let Lucene handle the heavy lifting of whether to search for an entire phrase or individual words. It also includes a way of parsing AND, OR, or NOT.

This has the most appeal to me for obvious reasons so it was the one I settled on first. Then came “research and development”. Searching for the phrase with quotes around it came back with false positives. I.e. documents that contained either research or development. So I halted the development and started some research.

StandardAnalyzer

This led to much reading about Analyzers. (And So I’ll echo many others’ sentiments by recommending the book, Lucene in Action, which has been a great resource.) I started out indexing and searching with the StandardAnalyzer. But this has a couple of side effects. For one, when indexing, it strips out common stop words, like the, a, and an. And As well as and.

On the search side, it will also do some parsing of the query phrase when you use it with QueryParser.Parse. In short, when you search for “research and development” (with quotes) using a StandardAnalyzer, the query is parsed to the following:

contents:”research development”

I.e. The and is taken out of the search phrase altogether. Not quite what I had in mind so a new track was laid.

SimpleAnalyzer

The SimpleAnalyzer indexes everything. Every word (and every position of every word if you tell it to). Obviously, the size of your index will grow considerably. In my testing ground, it quintupled in size from 19Mb to 100Mb based on 1600-odd Word and PDF documents.

On the search side, if you use a SimpleAnalyzer with the QueryParser, it does correctly identify the phrase “research and development” when you include it in quotes. So all appears happy and good…

…except that it doesn’t handle “r&d” (with or without the quotes) very well. The query is reduced to:

contents:”r d”

I.e. Find all documents with the letter r and the letter d as individual letters in them. Which, truth be told, isn’t such a bad thing on the surface. It means we’ll catch not only documents containing “r&d” but also those containing “R & D”. But by the same token (pun intended), it will also match documents containing “R. Buford D. Justice”

PhraseQuery

Another option I looked at was the PhraseQuery. If you use this, it will always search for the exact phrase. None o’ this “research development” or “r d” nonsuch.

But here, the analyzers come into play as well. If I search for “research and development”, that means the word and needs to be indexed. Which implies a SimpleAnalyzer during indexing. If I search for “r&d”, the SimpleAnalyzer won’t work because it breaks up words separated by ampersand.

From here…

That brings everyone up to speed to where I am now. I’ve posed the question on StackOverflow (my first!) and at the moment, the only answer to it suggests I write my own analyzer, one that acts like the StandardAnalyzer but doesn’t throw out the word and. That sounds reasonable to me, at least until someone searches for “research or development”.

Another option I’m considering is to tell the indexer to index specific phrases like “research and development” or “oil and gas” or other common ones used in the domain. Not sure I like the long-term maintenance of either option but search is a journey, not a destination, I suppose.

There’s a fundamental argument buried in here somewhere. Lucene gives you so much control over your indexing/searching that if you’re one of those Type A’s that can’t stand when something is just “good enough”, you can very easily drive yourself up the wall trying to optimize things. It really does require you to put some thought into how users will use your search. As much as Microsoft Indexing Services allowed me to throw up a search interface haphazardly, I believe you do yourself an injustice by not considering the ins and outs of the process.

 

By the way, a couple people asked about the code I used to extract text from Word and PDF docs. Lovingly provided by the venerable (and I hope I used the right word there) Brian Donahue, the relevant classes are attached in their entirety. The only thing different about this code snippet compared to others I’ve been sent in the past is that this one worked out of the box with absolutely no help from me. Seriously, I can’t even tell you what the internal method names are, that’s how little I looked at it. Call Parser.Parse(filename) and watch the magic fly.

Kyle the Found-ational

Thursday, September 10, 2009

Long-winded background for this long-winded post can be found here and here. The short version is: I have an app that searches using a full-text search of a document repository consisting of Office docs and PDFs.

The current version uses Microsoft Indexing Service and it is all but obsolete. Which is fine with me for the time being because of the economics of the situation. Namely, the app isn’t big enough to warrant putting the effort into updating it just for the sake of the technology.

Two things happened recently though that made me decide it was time to update. First was Simone Chiaretta’s masterfully-timed tutorial series on getting started with Lucene.NET. The second was the boss discovering the current version doesn’t actually work.

By many orders of magnitude, the boss is the biggest user of this application. And recently, he went about searching for a relatively common term: R&D. He was met with a nicely formatted 500 Server Error page and asked me if I would be so bold as to fix it.

A day and a half later and I simply had no fix. The existing app uses a SQL query to search the Indexing Service which, at the time, I thought was tres clever. But after trying a dozen methods of escaping it, there was no way I could get it to accept an ampersand. Furthermore, I also discovered the page failed when including words like ‘AND’ and ‘OR’. This problem was fixable but required some parsing of the search term and again, the cost/benefit for making this sort of thing bullet-proof just wasn’t there.

So when Simone’s tutorial started coming across my RSS feed, it just made sense to re-think the problem.

The main reason I resisted moving away from Indexing Service for so long is thus: I don’t need to manage the index myself. Once configured, the only thing I needed to do to add a document to the index was drop it into a folder.

But therein lay the problem: Because I don’t manage the index, I have no control over it. During my travels, I discovered I was getting false positives for some terms and that it was not returning all documents in some cases.

How did I discover this? I implemented Lucene and compared the results. If I discovered a discrepancy and it was because of how I indexed with Lucene, well, then I tweaked the indexing process. If the discrepancy was with Indexing Services…well, then I said, “&*%$ it! You’re getting replaced!”

Another bit of fortuitousness came in the form of one Brian Donahue. One of my fears going into this was how I would get the text out of the documents in order to index it. I had waking nightmares of Windows API calls and IFilters, dreading having to deal with this. So when, after outlining my pain on Twitter, Brian responded with, and I’m paraphrasing, “Here, take this code. It will do that for you and it WORKS RIGHT OUT OF THE BOX!”. I don’t mind admitting I developed an unhealthy infatuation with him for a short time after that.

In fact, it was through the process of managing the index myself with Lucene that I discovered one reason some documents weren’t getting indexed. They were RTF documents but they had a .doc extension so it was using the wrong IFilter to capture the text. It was through Brian’s code for extracting text that I figured this out. It threw errors on some documents and I couldn’t figure out why until I tried to Save As… when working with them in Word. Change the extension to its rightful .rtf and the indexing process hummed along. But with the Indexing Service, these documents simply weren’t indexed. No error, no notification. It’s possible a message was posted to the event log but that’s a little too passive even for me.

I hope to have some more technical details but I want to wait until Simone is further along so I don’t duplicate. I’ll very likely piggy-back off a couple of his posts but to summarize: Lucene.NET rocks and should be used any time you have a button labeled “Search”, “Find”, “Locate”, or “Git it!” Fear not the index-management process because like WebForms, the problem isn’t hard enough that it needs to be abstracted.

Kyle the Co-located

Wednesday, September 09, 2009

Bahamas Software Development User Group, we hardly knew ye.

It’s been just over a year since I started the short-lived group but alas! It is no more. In this post-mortem, we discuss What Went Wrong by providing smug pieces of advice fueled by 20/20 hindsight.

Know what you’re getting into

As much as you’d like to keep the process lean, there is always work to be done. Initially, you may be required to give the majority of the presentation. There may be sponsors to solicit, presenters to organize, and government officials to appease when you try to explain that that box of lasciviously-shaped USB keys is for an upcoming “code camp”.

Get help

If you want to follow the Ozark Symphony Orchestra around on its whirlwind tour of Athens, Prague, Vienna, and Paris,  you’ll need someone to fill in for you. A group run by a single person isn’t a group.

Be prepared for skepticism

Okay, this one surprised me when I made up my list. And since I recognize the perils of having unwavering optimism, it shouldn’t have. Many people I talked to came up with half a dozen reasons why it wouldn’t work: people are too secretive, it’s just another marketing tool for Company X, I work all day so why would I bother coming out in the evening.

The culmination of this was when one person accused me of using the group as a front to bring my “cronies” in to steal jobs from Bahamians and threatened to call the immigration department on me. Which is odd since I don’t work for a local company. Short version: some people will always look at what you aren’t doing rather than what you are.

Be flexible

I started the group as a .NET-specific one. In the group’s death throes, I broadened the scope to software development in general to account for the small size of the population and the wide variety of skills and interests. Many people are web designers who have had to learn programming to meet customer demands. And a session titled “Integrating Sharepoint with BizTalk” probably won’t have much relevance.

Know your public

This was, I believe, the one that effectively killed the group. I’ll have a follow-up post on it with more specifics when I’m able to keep my frustration at bay and can talk about it diplomatically.

 

In the end, whatever external factors exist, the primary reason the group didn’t work is because I didn’t have the fortitude to see it through. Maybe it was arrogance, maybe it was naiveté. Probably a bit of both. I wish this was only the first time I started something without anything more than good intentions. I doubt I’m the only one that starts things like this with an optimistic “let’s see what happens” without giving much thought into the work involved but it’s still kind of embarrassing that I folded up effectively because I didn’t feel like putting in the effort anymore.

I’d call it a lesson learned but we all know better…

Kyle the Unimproved

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Copyright © 2010 Kyle Baley. All rights reserved.
 
CATEGORIES
.NET General (18) alt.net (4) altnetconf (9) ASP.NET AJAX (40) ASP.NET MVC (29) Bahamas (1) Bahanet (9) BDD (1) Brownfield (18) Career (9) Castle (1) Code coverage (1) Coding Style (6) Communication (1) Community (18) Conscientious Coding (34) Continuous Integration (11) dasBlog (12) Development (16) DevTeach (4) Domain (2) Environment (4) Estimating (1) Featured (14) Flamingo (10) Games (1) Google App Engine (2) GWT (5) Hardware (6) Java (1) Javascript (7) Linq (2) Livelink (6) Lucene.NET (2) MbUnit (1) Metrics (1) Miscellaneous (24) Mocking (4) NAnt (4) NHibernate (12) NInject (1) Office (3) Office Development (6) Open Rasta (1) Patterns (5) Presenting (13) Professional Development (15) Refactoring (10) ReSharper (11) REST (2) S#arp Architecture (5) Security (3) Software (11) Sundry (18) TDD (19) Tools (21) User Interface (5) Utilities (8) Visual Studio (8) VSTO (1) Web development (12) Windows (3) Working Remotely (16) Workplace (3) Writing (4)
 
LATEST POSTS
 
POPULAR POSTS
 
 
ARCHIVE