My coffee site INeedCoffee needed a better search engine. I had thrown some basic SQL together when the site was launched back in 1999. It did an OK job when the site didn’t have much content. Over the years, the quality of the search results got worse and worse. So I did what any coder would do, I looked for a free solution.
Google did a better job searching my site than my own code, so I looked at their Google Custom Search solution. I didn’t like their free ad version and I didn’t what to pay them $100 each and every year for the non-ad version. I decided that not only could I write my own search engine that was just as fast, but I could also deliver better results to the users. After all, I knew my content better than anyone else.
Assigning a Quality Score
The first thing I noticed about Google’s search results is that the best article on a given topic often wasn’t listed first. It had no way to know the quality, but I did. So I added a quality score of 1 to 5 for every article. The default was 3. The best content was rated a 4 or 5. Articles that needed better photos or improved in some way, I would give a 1 or 2. Later I’d also use this quality score when assigning weight on the sitemap.
Web Form -> Server Side Code -> Stored Procedure
The HTML search form is pretty basic. A single text box and a submit button. What server-side code you use to call the stored procedure is irrelevant. ASP.NET, Classic ASP, PHP – it is all good. The server-side code will call the search stored procedure.
Two Temp Tables
The search stored procedure will have two temp tables: #searchWords and #searchResults. The purpose of #searchWords is to chop up any search phrase into individual words and then record their position. Later that position will be used to order search results, which more weight being placed on the first and second words in a search query. The #searchResults table is the results being returned to the web page.
CREATE TABLE #searchWords (
word VARCHAR(100),
position INT
)
CREATE TABLE #searchResults (
url VARCHAR(100),
title VARCHAR(100),
longDesc VARCHAR(MAX),
quality TINYINT,
score INT
)
Splitting Search Phrases
For this functionality, I found some code on StackOverflow that did the job. The SplitWordList user-defined function by Terrapin works perfectly. If the user places the search term inside quotes, I do not call the SplitWordFunction and inside enter the entire phrase as one row in the #searchWords table.
INSERT INTO #searchWords
SELECT word, position
FROM SplitWordList(@searchString)
Count String Function
For the actual search, I used the Count String Occurrence Function. The search words are compared first against the article title and then the content itself.
CREATE FUNCTION [dbo].[udfCountString](
@InputString VARCHAR(MAX),
@SearchString VARCHAR(100)
)
RETURNS INT
BEGIN
RETURN (LEN(@InputString) -
LEN(REPLACE(@InputString, @SearchString, ''))) /
LEN(@SearchString)
END
I Like Cursors
The most straight forward approach I could think of for getting search results was to use two cursors. One with the content and one with the search words. Then write the hits to the #searchResults temp table. But cursors are often frowned upon for poor performance. I decided I would first code the search engine using Cursors and then if I ran into a performance problem, I’d come up with an alternate solution. But I didn’t need to, as I got rocking fast results using CURSORS.
DECLARE ContentCursor CURSOR FAST_FORWARD FOR
SELECT url, title, longDesc, quality, page
FROM Articles
DECLARE SearchWordCursor CURSOR DYNAMIC FOR
SELECT word, position FROM #searchWords
OPEN SearchWordCursor
OPEN ContentCursor
FETCH NEXT FROM ContentCursor
INTO @url, @title, @longDesc, @quality, @page
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH FIRST FROM SearchWordCursor
INTO @word, @position
WHILE @@FETCH_STATUS = 0
BEGIN
-- place more weight on the first search term
SELECT @score = CASE @position
WHEN 1 THEN 3
WHEN 2 THEN 2
ELSE 1
END
-- search the TITLE
SET @count = dbo.udfCountString(@title, @word)
IF @count > 0
BEGIN
INSERT INTO #searchResults
VALUES (@url, @title, @longDesc, @quality, @score * 10)
END
-- search the PAGE
SET @count = dbo.udfCountString(@page, @word)
IF @count > 0
BEGIN
INSERT INTO #searchResults
VALUES (@url, @title, @longDesc, @quality, @score)
END
FETCH NEXT FROM SearchWordCursor
INTO @word, @position
END
FETCH NEXT FROM ContentCursor
INTO @url, @title, @longDesc, @quality, @page
END
CLOSE ContentCursor
DEALLOCATE ContentCursor
CLOSE SearchWordCursor
DEALLOCATE SearchWordCursor
Working With the Results
Before dropping both temp tables, here is the query used to return the search results. If you look at the SQL above you will see that it is possible (likely) that a search hit will take place on both the title and the page content. I ran some tests and determined that a search hit against a word in the title was 10 times more important than the content, so I multiply the score time ten if there is a title match.
To flatten the results, I use a GROUP BY clause in the SQL. Then the results are returned in order from highest to lowest scores.
SELECT TOP 20 S.url, S.title, S.longDesc,
S.quality, SUM(S.score) AS Score
FROM #searchResults S
GROUP BY S.url, S.title, S.longDesc, S.quality
ORDER BY SUM(S.score) DESC, S.Quality DESC
Better Than Google?
I ran numerous tests comparing my search engine to Google. My hand-coded INeedCoffee search engine delivered better results at equal or faster speeds. And the best part is I don’t need to send Google a check for $100 every year.
Code
All the above code is available on GitHub.
Jim
Apr 25, 2014 — 12:00 pm
The only missing is the variables declaration for datas to fetch in
Jim
May 1, 2014 — 11:42 am
Came back to say thank you for the code. The code is brilliant and fast.
Keep up the good work.
MAS
May 1, 2014 — 4:30 pm
@Jim – Thanks. Glad to hear it worked.
Paa
Jul 21, 2014 — 5:24 pm
Hello Sir,
I’m building a basic “search engine” and i mean a true search engine such as yahoo, bing and google
i can use crawlers, I program using python… but these search engine stuff is new to me,
i have read a lot on the internet but still can really grasp the big picture on the server side.
i have asked many questions concerning SQL and have received alot of different answers…
such you don’t need SQL and you do need SQL i’m just very confused. i would seriously appreciate a very detailed
explanation of what goes on behind the HTML page. example after you hit the enter key what program grabs the keywords and then sends it to the server and how do the server interacts with the database”if there is any” which i think there should be.
MAS
Jul 21, 2014 — 5:30 pm
@Paa – This post is for people with knowledge of SQL and SQL Server. I’m sure there are many ways to write a search engine. This is but one solution, for which I am not offering support.