Techblog Index

Ordering by fields that contains null values

By default, null values are put on top of the query resultset when field is ordered by in ascendant form.

This comes to be a problem in many scenarios, specially when we are ordering by a position field that can contain an integer value for its position on the dataset or null if position is not defined. Rows that have undefined position have lower weight than the specified ones thus coming first.

The following SQL query is from a very common scenario that represents a SELECT to fetch all city registries in “importance” (most common, not in fact important cities – don’t get mad if you live in an odd city) order.

SELECT
  id, city
FROM
  cities
ORDER BY
  POSITION;

This brings us all null-valued position rows first and not null positioned in ascending order, at the bottom.

This happens because our ordering pool will look like the following:
position (integer or null), city field value (string)

So null values are considered smaller than 1 (lowest positive integer) and then comes first in our resultset.

Workaround

Workaround is never a good choice, but I’m listing here so you DON’T, EVER do that.

Using a huge integer value instead of null ones
Setting the default field value to a big integer so it will come always on the bottom of the list. This is particularly bad where it compromises your database integrity since your data will have a fake value instead of a null value. null means “not set”, “undefined”. Setting it otherwise will not mean the same. Database integrity is important so your data can be handled and transported by other apps. You may not bother with this now, but you may in the future.

Doing it maintaining data structure and integrity

We will take advantage of CASE native from most DBMS.

SELECT
  id, city
FROM
  cities
ORDER BY
  CASE WHEN POSITION IS NULL THEN 1 ELSE 0 END,
  POSITION,
  city

NOTE: Field city is added on ordering list in behalf of sorting in ascending order the rows with undefined position

CASE is a good SQL way to implement flow-control under database-level. In this case, if position field is null CASE returns 1, 0 otherwise.

This will make the order pool to be like:
(0/1), position (integer or null), city field value (string)

This way, other ordering criteria position and city will be treated as secondary and tertiary respectively, depending on the (0,1) value. 1 is bigger than 0 so if all null values are treated as 1, they will come after the ones that has a position integer representation that will come with 0 as first criteria.

Applying this concept on big tables

As CASE must do a calculation for every single row we may (and probably will) encounter speed issues when handling this ordering on big tables.

A solution is to mantain a boolean field like has_position and use it on the first criteria so our order pool gets something like:
has_position, position (integer or null), city field value (string)

The SQL code as it follows:

SELECT
  id, city
FROM
  cities
ORDER BY
  has_position,
  POSITION,
  city

NOTE: See which comes first (true/false) on boolean ordering on your DBMS and apply the ASC/DESC clause to has_position accordingly

The drawback is that you will have to run once in a while (or on position change) a SQL query to update the registries:

UPDATE cities SET has_position = 1 WHERE POSITION IS NOT NULL;
UPDATE cities SET has_position = 0 WHERE POSITION IS NULL;

You can even use CASE for this, but will have the same speed issues.

UPDATE cities SET has_position = CASE WHEN POSITION IS NULL THEN 0 ELSE 1 END;

The difference is that you will have this issue just time to time instead of every query.

Benchmarking

We tested the three methods: only with position field (that doesn’t returns data in the way we want), with has_position, position (best form proposed for big tables) and with the CASE trick.

The results are as expected:

Benchmarking of ordering techniques on null fields with 249751 rows

Exactly:

position
25,77 secs
has_position, position
12,83 secs
CASE trick
23,20 secs

The tests were run on a HP Pavillion DV6780SE with a Core2 Duo 1.66ghz, 3GB ram with 2.11.3deb1ubuntu1.1 mysql running under Ubuntu Linux 8.04 on a table (without indexes) with 249.751 rows.

Download the test files

Test files include

positioning.sql
Database schema
wordlist_pt_br.txt
pt_BR Wordlist (use you preferred wordlist here)
mysql_case_speed_test.php
Structured programmed benchmark Script. Call with ?populate to populate database and generate random position numbers.

Download order-with-null-values.zip

Whats your nifty trick?

 
 

Reader's thoughts on "Ordering by fields that contains null values"

2
  1. Can i suggest you a little performance test if this scenario using the PostgreSQL’s analyzer? They can provide the cost of the queries and give you some numbers to illustrate this approach.

  2. Done as Fernando’s recommendation

Leave a Reply