<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>HEAVYWORKS &#187; Database</title> <atom:link href="http://www.heavyworks.net/blog/category/database/feed" rel="self" type="application/rss+xml" /><link>http://www.heavyworks.net</link> <description>Extreme Software Engineering</description> <lastBuildDate>Fri, 27 Aug 2010 01:55:58 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3</generator> <item><title>Ordering by fields that contains null values</title><link>http://www.heavyworks.net/blog/posts/ordering-by-fields-that-contains-null-values</link> <comments>http://www.heavyworks.net/blog/posts/ordering-by-fields-that-contains-null-values#comments</comments> <pubDate>Thu, 02 Jul 2009 17:47:12 +0000</pubDate> <dc:creator>Jan Seidl</dc:creator> <category><![CDATA[Database]]></category> <category><![CDATA[ordering]]></category> <category><![CDATA[sql]]></category> <guid
isPermaLink="false">http://www.heavyworks.net/?p=242</guid> <description><![CDATA[By default, null values are put on top of the query resultset when field is ordered by in ascendant form. This comes to be a problem in many scenarios, specially when we are ordering by a position field that can contain an integer value for its position on the dataset or null if position is [...]
No related posts.]]></description> <content:encoded><![CDATA[<p>By default, <code>null</code> values are put on top of the query resultset when field is ordered by in ascendant form.</p><p>This comes to be a problem in many scenarios, specially when we are ordering by a <code>position</code> field that can contain an integer value for its position on the dataset or <code>null</code> if position is not defined. Rows that have undefined position have lower weight than the specified ones thus coming first.</p><p>The following <acronym
title="Structured Query Language">SQL</acronym> query is from a very common scenario that represents a <code>SELECT</code> to fetch all <code>city</code> registries in &#8220;importance&#8221; (most common, not in fact important cities &#8211; don&#8217;t get mad if you live in an odd city) order.</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span>
  id<span style="color: #66cc66;">,</span> city
<span style="color: #993333; font-weight: bold;">FROM</span>
  cities
<span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span>
  <span style="color: #993333; font-weight: bold;">POSITION</span>;</pre></div></div><p>This brings us all <code>null</code>-valued <code>position</code> rows first and <code>not null</code> positioned in ascending order, at the bottom.</p><p>This happens because our ordering pool will look like the following:<br
/> <code>position (integer or null), city field value (string)</code></p><p>So <code>null</code> values are considered smaller than 1 (lowest positive integer) and then comes first in our resultset.<br
/> <span
id="more-242"></span></p><h2>Workaround</h2><p>Workaround is never a good choice, but I&#8217;m listing here so you <em>DON&#8217;T, EVER</em> do that.</p><dl><dt>Using a huge integer value instead of null ones</dt><dd>Setting the default field value to a big integer so it will come always on the bottom of the list. This is particularly bad where it compromises your database integrity since your data will have a fake value instead of a <code>null</code> value. <code>null</code> means &#8220;not set&#8221;, &#8220;undefined&#8221;. Setting it otherwise will not mean the same. Database integrity is important so your data can be handled and transported by other apps. You may not bother with this now, but you may in the future.</dd></dl><h2>Doing it maintaining data structure and integrity</h2><p>We will take advantage of <code>CASE</code> native from most <acronym
title="Database Management System">DBMS</acronym>.</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span>
  id<span style="color: #66cc66;">,</span> city
<span style="color: #993333; font-weight: bold;">FROM</span>
  cities
<span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span>
  <span style="color: #993333; font-weight: bold;">CASE</span> <span style="color: #993333; font-weight: bold;">WHEN</span> <span style="color: #993333; font-weight: bold;">POSITION</span> <span style="color: #993333; font-weight: bold;">IS</span> <span style="color: #993333; font-weight: bold;">NULL</span> <span style="color: #993333; font-weight: bold;">THEN</span> <span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">ELSE</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">END</span><span style="color: #66cc66;">,</span>
  <span style="color: #993333; font-weight: bold;">POSITION</span><span style="color: #66cc66;">,</span>
  city</pre></div></div><p><em>NOTE: Field <code>city</code> is added on ordering list in behalf of sorting in ascending order the rows with undefined <code>position</code></em></p><p><code>CASE</code> is a good <acronym
title="Structured Query Language">SQL</acronym> way to implement flow-control under database-level. In this case, if <code>position</code> field is <code>null</code> <code>CASE</code> returns 1, 0 otherwise.</p><p>This will make the order pool to be like:<br
/> <code>(0/1), position (integer or null), city field value (string)</code></p><p>This way, other ordering criteria <code>position</code> and <code>city</code> will be treated as secondary and tertiary respectively, depending on the (0,1) value. 1 is bigger than 0 so if all <code>null</code> values are treated as 1, they will come after the ones that has a <code>position</code> integer representation that will come with 0 as first criteria.</p><h2>Applying this concept on big tables</h2><p>As <code>CASE</code> must do a calculation for every single row we may (and probably will) encounter speed issues when handling this ordering on big tables.</p><p>A solution is to mantain a <code>boolean</code> field like <code>has_position</code> and use it on the first criteria so our order pool gets something like:<br
/> <code>has_position, position (integer or null), city field value (string)</code></p><p>The <acronym
title="Structured Query Language">SQL</acronym> code as it follows:</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span>
  id<span style="color: #66cc66;">,</span> city
<span style="color: #993333; font-weight: bold;">FROM</span>
  cities
<span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span>
  has_position<span style="color: #66cc66;">,</span>
  <span style="color: #993333; font-weight: bold;">POSITION</span><span style="color: #66cc66;">,</span>
  city</pre></div></div><p><em>NOTE: See which comes first (true/false) on <code>boolean</code> ordering on your <acronym
title="Database Management System">DBMS</acronym> and apply the ASC/DESC clause to <code>has_position</code> accordingly</em></p><p>The drawback is that you will have to run once in a while (or on position change) a <acronym
title="Structured Query Language">SQL</acronym> query to update the registries:</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">UPDATE</span> cities <span style="color: #993333; font-weight: bold;">SET</span> has_position <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">WHERE</span> <span style="color: #993333; font-weight: bold;">POSITION</span> <span style="color: #993333; font-weight: bold;">IS</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span>;
<span style="color: #993333; font-weight: bold;">UPDATE</span> cities <span style="color: #993333; font-weight: bold;">SET</span> has_position <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">WHERE</span> <span style="color: #993333; font-weight: bold;">POSITION</span> <span style="color: #993333; font-weight: bold;">IS</span> <span style="color: #993333; font-weight: bold;">NULL</span>;</pre></div></div><p>You can even use <code>CASE</code> for this, but will have the same speed issues.</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">UPDATE</span> cities <span style="color: #993333; font-weight: bold;">SET</span> has_position <span style="color: #66cc66;">=</span> <span style="color: #993333; font-weight: bold;">CASE</span> <span style="color: #993333; font-weight: bold;">WHEN</span> <span style="color: #993333; font-weight: bold;">POSITION</span> <span style="color: #993333; font-weight: bold;">IS</span> <span style="color: #993333; font-weight: bold;">NULL</span> <span style="color: #993333; font-weight: bold;">THEN</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">ELSE</span> <span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">END</span>;</pre></div></div><p>The difference is that you will have this issue just time to time instead of every query.</p><h2>Benchmarking</h2><p>We tested the three methods: only with <code>position</code> field (that doesn&#8217;t returns data in the way we want), with <code>has_position, position</code> (best form proposed for big tables) and with the <code>CASE</code> trick.</p><p>The results are as expected:</p><p><img
src="http://www.heavyworks.net/wordpress/wp-content/uploads/position_ordering_benchmark_249751_rows.gif" alt="Benchmarking of ordering techniques on null fields with 249751 rows" title="Benchmarking of ordering techniques on null fields with 249751 rows" width="300" height="330" class="aligncenter size-full wp-image-253" /></p><p>Exactly:</p><dl><dt>position</dt><dd>25,77 secs</dd><dt>has_position, position</dt><dd>12,83 secs</dd><dt>CASE trick</dt><dd>23,20 secs</dd></dl><p><em>The tests were run on a HP Pavillion DV6780SE with a Core2 Duo 1.66ghz, 3GB ram with 2.11.3deb1ubuntu1.1 mysql running under Ubuntu Linux 8.04 on a table (without indexes) with 249.751 rows.</em></p><h3>Download the test files</h3><p>Test files include</p><dl><dt>positioning.sql</dt><dd>Database schema</dd><dt>wordlist_pt_br.txt</dt><dd>pt_BR Wordlist (use you preferred wordlist here)</dd><dt>mysql_case_speed_test.php</dt><dd>Structured programmed benchmark Script. Call with <code>?populate</code> to populate database and generate random position numbers.</dd></dl><p><a
href="http://www.heavyworks.net/wordpress/wp-content/uploads/scripts/database/benchmark/order-with-null-values.zip">Download <em>order-with-null-values.zip</em></a></p><p>Whats your nifty trick?</p><p>No related posts.</p>]]></content:encoded> <wfw:commentRss>http://www.heavyworks.net/blog/posts/ordering-by-fields-that-contains-null-values/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>The Bomb vs The Rifle: Effective data searching</title><link>http://www.heavyworks.net/blog/posts/the-bomb-vs-the-rifle-effective-data-searching</link> <comments>http://www.heavyworks.net/blog/posts/the-bomb-vs-the-rifle-effective-data-searching#comments</comments> <pubDate>Sun, 28 Dec 2008 20:23:58 +0000</pubDate> <dc:creator>Jan Seidl</dc:creator> <category><![CDATA[Database]]></category> <category><![CDATA[indexing]]></category> <category><![CDATA[search]]></category> <guid
isPermaLink="false">http://www.heavyworks.net/?p=115</guid> <description><![CDATA[Searching never seemed something really hard to do because SQL&#8217;s LIKE was there to aid the oppressed but&#8230; does LIKE the job right? So we have data and when it gets bigger and bigger, we need to search through it. Let&#8217;s take some approaches to the Search function and find out the better way. The [...]
No related posts.]]></description> <content:encoded><![CDATA[<p>Searching never seemed something really hard to do because <acronym
title="Structured Query Language">SQL</acronym>&#8217;s <code><a
alt="SQL LIKE Operator" title="SQL LIKE Operator" href="http://www.w3schools.com/SQL/sql_like.asp">LIKE</a></code> was there to aid the oppressed but&#8230; does <code>LIKE</code> the job right?</p><p>So we have data and when it gets bigger and bigger, we need to search through it. Let&#8217;s take some approaches to the Search function and find out the better way.<br
/> <span
id="more-115"></span></p><h3>The Search Scope: Who will be using it and what they look for</h3><p>The answer is short: <em>people</em> (do you hear the sound of evil?)<br
/> People has the natural-born fear of technology so we can expect that the search terms will be as less-accurate as possible. Imagine your user sitting before your search box and thinking &#8220;What should I put on this little box to get what I want?&#8221; and then putting a single word about it.</p><p>Example:<br
/> <em>The user wants to find about a pretty neat Mont-Blanc pen.</em><br
/> The possible keywords this user will enter are: <em>mont-blanc</em> and <em>pen</em> (unless the user is high, there isn&#8217;t much more keywords to serve)</p><p>Let&#8217;s assume he doesn&#8217;t wan&#8217;t to be that specific (may be looking for other models) and search only for <em>pen</em>.</p><h3>The Search Methods: The Bomb and the Rifle</h3><p><strong>The Bomb: The <code>LIKE</code> method</strong><br
/> Widely diffused, the <code>LIKE</code> method is the most popular since it can be ran in any table within any datatype. This method uses a comparison function present in many <acronym
title="Database Management System"><a
href="http://en.wikipedia.org/wiki/Database_management_system"><acronym
title="Database Management System">DBMS</acronym></a></acronym> that searches for anything matching a * (everything) pattern on the side specified with a percentage (%) symbol.</p><p>For example:</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span>
<span style="color: #993333; font-weight: bold;">FROM</span> keyword
<span style="color: #993333; font-weight: bold;">WHERE</span> keyword <span style="color: #993333; font-weight: bold;">LIKE</span> <span style="color: #ff0000;">'%pen%'</span>; <span style="color: #808080; font-style: italic;">-- will get *pen*</span></pre></div></div><p><em>NOTE: <a
href="http://www.postgresql.org/">PostgreSQL</a> has an implementation called <code>ILIKE</code> that performs the same on a case-insensitive manner.</em></p><p><em>Drawbacks:</em><br
/> <strong>Speed issues</strong><br
/> This method runs on a data retrieving method called <code>SEQ SCAN</code> (or Sequential Scan). It basically runs through all the table comparing your query with the data, one by one. It takes time. In this tiny case (13 records) it took 0.0009s (0.9ms). Big tables&#8217; nightmare.</p><p><strong>Relevancy issues</strong><br
/> As it only returns rows that matches in any way, you can&#8217;t get a keyword density analysis before data has being retrieved and thus show less-relevant products to the client. Business owners&#8217; nightmare.</p><p><strong>Senseness issues</strong><br
/> Imagine our <em>pen</em> example. Could you imagine what results <code>pen*</code> could bring up? I think our Mont-blanc user will need some parental control. Users&#8217; nightmare.</p><p>That is why I call him &#8216;The Bomb&#8217; since it returns you loads of data with short accuracy.</p><p><strong>The Rifle Method: Indexing words</strong><br
/> This method is what the Google Search Era represents now. Relevance, best-matching, keyword density.<br
/> When we say &#8216;We&#8217;ll have to wait until Google indexes&#8230;&#8217; or &#8216;My page is poorly indexed&#8230;&#8217; that means this indexing: Breaking your text into words and saving its position (starting from 0) on the field that will be searched for and storing it on a <em>keyword index table</em>. The basic keyword index table formula is the following: <code>id-of-content-registry</code>,<code>the-keyword</code>,<code>the-field-where-keyword-is-located</code> and <code>the-position-of-the-keyword-on-the-field's-content</code>. So when we search for &#8217;007&#8242; we perform a <em>exact-match</em> query:</p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span>
<span style="color: #993333; font-weight: bold;">FROM</span> keyword_index
<span style="color: #993333; font-weight: bold;">WHERE</span> keyword <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">'pen'</span>; <span style="color: #808080; font-style: italic;">-- only exact pen match</span></pre></div></div><p><em>NOTE: Surely Google analyzes much more than the position on the pages&#8217; content such as context, semanticsness etc.</em></p><p><em>Gains:</em><br
/> <strong>Speed issues</strong><br
/> This query (in the same tiny case (13 records)) took 0.0006s (0.6ms) what is an insignificant difference on this scenario with few records that will show up broadly when you reach high volume of data. Big tables&#8217; heaven</p><p><strong>Relevancy Issues</strong><br
/> With the <code>position</code> field being stored we have info of each time the keyword appeared on a text. Performing a simple <code>COUNT()</code> will feed us with the information of how many times that keyword appears on the text (density) so we can sort via relevancy to the end-user. Business owners&#8217; heaven</p><p><strong>Senseness Issues</strong><br
/> In this method we may get <em>Mont-blanc pen</em>, <em>Lamy pen</em> (and no <em>Penthouse</em> and other non-user-friendly things). Users&#8217; heaven</p><p>The indexed systems may vary. You may have the <code>position</code> field or not. You may have even more fields. Depends on your search criteria.</p><p>So, depends the size of the target you wanna reach.<br
/> The Rifle (Indexed) search method gives you more speed and accuracy, The Bomb (<code>LIKE</code>) search method gives your more results.</p><p>Choose your weapon!</p><p>No related posts.</p>]]></content:encoded> <wfw:commentRss>http://www.heavyworks.net/blog/posts/the-bomb-vs-the-rifle-effective-data-searching/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Quick Tip: Capitalize field in MySql</title><link>http://www.heavyworks.net/blog/posts/quick-tip-capitalize-field-in-mysql</link> <comments>http://www.heavyworks.net/blog/posts/quick-tip-capitalize-field-in-mysql#comments</comments> <pubDate>Wed, 17 Dec 2008 02:44:22 +0000</pubDate> <dc:creator>Jan Seidl</dc:creator> <category><![CDATA[Database]]></category> <category><![CDATA[capitalize]]></category> <category><![CDATA[mysql]]></category> <guid
isPermaLink="false">http://www.heavyworks.net/?p=24</guid> <description><![CDATA[This one was quite interesting. I&#8217;ve wrote my PHP function to capitalize the fileld at insert with ucwords() but I already had some on the base that I need to convert without re-importing them. Googled for the answer and found two good pieces of code: If you want it simple, like ucfirst(), getting only the [...]
No related posts.]]></description> <content:encoded><![CDATA[<p>This one was quite interesting. I&#8217;ve wrote my <acronym
title="PHP: Hypertext Preprocessor">PHP</acronym> function to capitalize the fileld at insert with <a
href="http://www.php.net/ucwords">ucwords()</a> but I already had some on the base that I need to convert without re-importing them.</p><p><span
id="more-24"></span></p><p>Googled for the answer and found two good pieces of code:</p><p>If you want it <a
href="http://en.wikipedia.org/wiki/KISS_principle">simple</a>, like <a
href="http://www.php.net/ucfirst">ucfirst()</a>, getting only the very first letter of the first word, you may have:</p><p><div
class="wp_syntax"><div
class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">UPDATE</span> <span style="color: #ff0000;">`table`</span> <span style="color: #993333; font-weight: bold;">SET</span> target_field <span style="color: #66cc66;">=</span>
        CONCAT<span style="color: #66cc66;">&#40;</span>
                UCASE<span style="color: #66cc66;">&#40;</span><span style="color: #993333; font-weight: bold;">SUBSTRING</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">`source_field`</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">,</span>
                <span style="color: #993333; font-weight: bold;">LOWER</span><span style="color: #66cc66;">&#40;</span><span style="color: #993333; font-weight: bold;">SUBSTRING</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">`source_field`</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
        <span style="color: #66cc66;">&#41;</span>;</pre></div></div></p><p>Just replace &#8216;table&#8217; with the table&#8217;s name and &#8216;target_field&#8217; with your&#8230; target field?</p></li><p>If you want the job done in all terms, <a
href="http://joezack.com/index.php/2008/10/20/mysql-capitalize-function/">Joe Zack&#8217;s MySQL Capitalize Function</a> may fit you as well as it fit me.</p><p>Who&#8217;s got a better idea?</p><p>No related posts.</p>]]></content:encoded> <wfw:commentRss>http://www.heavyworks.net/blog/posts/quick-tip-capitalize-field-in-mysql/feed</wfw:commentRss> <slash:comments>1</slash:comments> </item> </channel> </rss>
