My coffee site INeedCoffee needed a better search engine. I had thrown some basic SQL together when the site was launched back in 1999. It did an OK job when the site didn’t have much content. Over the years, the quality of the search results got worse and worse. So I did what any coder would do, I looked for a free solution.
Google did a better job searching my site than my own code, so I looked at their Google Custom Search solution. I didn’t like their free ad version and I didn’t what to pay them $100 each and every year for the non-ad version. I decided that not only could I write my own search engine that was just as fast, but I could also deliver better results to the users. After all, I knew my content better than anyone else.
Assigning a Quality Score
The first thing I noticed about Google’s search results is that the best article on a given topic often wasn’t listed first. It had no way to know the quality, but I did. So I added a quality score of 1 to 5 for every article. The default was 3. The best content was rated a 4 or 5. Articles that needed better photos or improved in some way, I would give a 1 or 2. Later I’d also use this quality score when assigning weight on the sitemap.
Web Form -> Server Side Code -> Stored Procedure
The HTML search form is pretty basic. A single text box and a submit button. What server-side code you use to call the stored procedure is irrelevant. ASP.NET, Classic ASP, PHP – it is all good. The server-side code will call the search stored procedure.
Two Temp Tables
The search stored procedure will have two temp tables: #searchWords and #searchResults. The purpose of #searchWords is to chop up any search phrase into individual words and then record their position. Later that position will be used to order search results, which more weight being placed on the first and second words in a search query. The #searchResults table is the results being returned to the web page.
CREATE TABLE #searchWords ( word VARCHAR(100), position INT ) CREATE TABLE #searchResults ( url VARCHAR(100), title VARCHAR(100), longDesc VARCHAR(MAX), quality TINYINT, score INT )
Splitting Search Phrases
For this functionality, I found some code on StackOverflow that did the job. The SplitWordList user-defined function by Terrapin works perfectly. If the user places the search term inside quotes, I do not call the SplitWordFunction and inside enter the entire phrase as one row in the #searchWords table.
INSERT INTO #searchWords SELECT word, position FROM SplitWordList(@searchString)
Count String Function
For the actual search, I used the Count String Occurrence Function. The search words are compared first against the article title and then the content itself.
CREATE FUNCTION [dbo].[udfCountString]( @InputString VARCHAR(MAX), @SearchString VARCHAR(100) ) RETURNS INT BEGIN RETURN (LEN(@InputString) - LEN(REPLACE(@InputString, @SearchString, ''))) / LEN(@SearchString) END
I Like Cursors
The most straight forward approach I could think of for getting search results was to use two cursors. One with the content and one with the search words. Then write the hits to the #searchResults temp table. But cursors are often frowned upon for poor performance. I decided I would first code the search engine using Cursors and then if I ran into a performance problem, I’d come up with an alternate solution. But I didn’t need to, as I got rocking fast results using CURSORS.
DECLARE ContentCursor CURSOR FAST_FORWARD FOR SELECT url, title, longDesc, quality, page FROM Articles DECLARE SearchWordCursor CURSOR DYNAMIC FOR SELECT word, position FROM #searchWords OPEN SearchWordCursor OPEN ContentCursor FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page WHILE @@FETCH_STATUS = 0 BEGIN FETCH FIRST FROM SearchWordCursor INTO @word, @position WHILE @@FETCH_STATUS = 0 BEGIN -- place more weight on the first search term SELECT @score = CASE @position WHEN 1 THEN 3 WHEN 2 THEN 2 ELSE 1 END -- search the TITLE SET @count = dbo.udfCountString(@title, @word) IF @count > 0 BEGIN INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score * 10) END -- search the PAGE SET @count = dbo.udfCountString(@page, @word) IF @count > 0 BEGIN INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score) END FETCH NEXT FROM SearchWordCursor INTO @word, @position END FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page END CLOSE ContentCursor DEALLOCATE ContentCursor CLOSE SearchWordCursor DEALLOCATE SearchWordCursor
Working With the Results
Before dropping both temp tables, here is the query used to return the search results. If you look at the SQL above you will see that it is possible (likely) that a search hit will take place on both the title and the page content. I ran some tests and determined that a search hit against a word in the title was 10 times more important than the content, so I multiply the score time ten if there is a title match.
To flatten the results, I use a GROUP BY clause in the SQL. Then the results are returned in order from highest to lowest scores.
SELECT TOP 20 S.url, S.title, S.longDesc, S.quality, SUM(S.score) AS Score FROM #searchResults S GROUP BY S.url, S.title, S.longDesc, S.quality ORDER BY SUM(S.score) DESC, S.Quality DESC
Better Than Google?
I ran numerous tests comparing my search engine to Google. My hand-coded INeedCoffee search engine delivered better results at equal or faster speeds. And the best part is I don’t need to send Google a check for $100 every year.
All the above code is available on GitHub.