Techblog Index

The Bomb vs The Rifle: Effective data searching

Searching never seemed something really hard to do because SQL’s LIKE was there to aid the oppressed but… does LIKE the job right?

So we have data and when it gets bigger and bigger, we need to search through it. Let’s take some approaches to the Search function and find out the better way.

The Search Scope: Who will be using it and what they look for

The answer is short: people (do you hear the sound of evil?)
People has the natural-born fear of technology so we can expect that the search terms will be as less-accurate as possible. Imagine your user sitting before your search box and thinking “What should I put on this little box to get what I want?” and then putting a single word about it.

Example:
The user wants to find about a pretty neat Mont-Blanc pen.
The possible keywords this user will enter are: mont-blanc and pen (unless the user is high, there isn’t much more keywords to serve)

Let’s assume he doesn’t wan’t to be that specific (may be looking for other models) and search only for pen.

The Search Methods: The Bomb and the Rifle

The Bomb: The LIKE method
Widely diffused, the LIKE method is the most popular since it can be ran in any table within any datatype. This method uses a comparison function present in many DBMS that searches for anything matching a * (everything) pattern on the side specified with a percentage (%) symbol.

For example:

SELECT *
FROM keyword
WHERE keyword LIKE '%pen%'; -- will get *pen*

NOTE: PostgreSQL has an implementation called ILIKE that performs the same on a case-insensitive manner.

Drawbacks:
Speed issues
This method runs on a data retrieving method called SEQ SCAN (or Sequential Scan). It basically runs through all the table comparing your query with the data, one by one. It takes time. In this tiny case (13 records) it took 0.0009s (0.9ms). Big tables’ nightmare.

Relevancy issues
As it only returns rows that matches in any way, you can’t get a keyword density analysis before data has being retrieved and thus show less-relevant products to the client. Business owners’ nightmare.

Senseness issues
Imagine our pen example. Could you imagine what results pen* could bring up? I think our Mont-blanc user will need some parental control. Users’ nightmare.

That is why I call him ‘The Bomb’ since it returns you loads of data with short accuracy.

The Rifle Method: Indexing words
This method is what the Google Search Era represents now. Relevance, best-matching, keyword density.
When we say ‘We’ll have to wait until Google indexes…’ or ‘My page is poorly indexed…’ that means this indexing: Breaking your text into words and saving its position (starting from 0) on the field that will be searched for and storing it on a keyword index table. The basic keyword index table formula is the following: id-of-content-registry,the-keyword,the-field-where-keyword-is-located and the-position-of-the-keyword-on-the-field's-content. So when we search for ’007′ we perform a exact-match query:

SELECT *
FROM keyword_index
WHERE keyword = 'pen'; -- only exact pen match

NOTE: Surely Google analyzes much more than the position on the pages’ content such as context, semanticsness etc.

Gains:
Speed issues
This query (in the same tiny case (13 records)) took 0.0006s (0.6ms) what is an insignificant difference on this scenario with few records that will show up broadly when you reach high volume of data. Big tables’ heaven

Relevancy Issues
With the position field being stored we have info of each time the keyword appeared on a text. Performing a simple COUNT() will feed us with the information of how many times that keyword appears on the text (density) so we can sort via relevancy to the end-user. Business owners’ heaven

Senseness Issues
In this method we may get Mont-blanc pen, Lamy pen (and no Penthouse and other non-user-friendly things). Users’ heaven

The indexed systems may vary. You may have the position field or not. You may have even more fields. Depends on your search criteria.

So, depends the size of the target you wanna reach.
The Rifle (Indexed) search method gives you more speed and accuracy, The Bomb (LIKE) search method gives your more results.

Choose your weapon!

 
 

Reader's thoughts on "The Bomb vs The Rifle: Effective data searching"

1
  1. That’s it, searching with LIKE is very very bad when your database has searched beyond a few hundred rows.
    When you have multiple fields (or tables)to search, things become much worse, because you have to combine several LIKE searchs with an OR, which will not run fast, even with text indexing on the fields.

Leave a Reply