A common problem when optimizing COUNT()

This article was originally published at http://www.mysqlperformanceblog.com/2008/09/20/a-common-problem-when-optimizing-count/

A good coding standard can help here -- you can quickly write a tool to grep your source code and search for instances of such constructs, and flag them as errors unless some magic comment is embedded in the code next to them:

SELECT count(col1) FROM TABLE; -- COUNT_COL_OK

Alas, there is a lot of misinformation on the web about COUNT(), so it is not a good place to learn about what it does. 90% of what you can find online is just wrong. So I would advise a good book on the topic, except I can't think of one -- maybe Joe Celko's books address this topic clearly? What's your recommendation?

So when I see a query that just does COUNT(col1) I am left with these guesses:

You meant to count the number of rows. You should have written COUNT(*) to convey that meaning accurately. If the column is nullable, there's a bug in your query -- it won't do what you think it does.
You meant to count the number of distinct values, but you left out the DISTINCT keyword. That's also a bug in your query.
You meant to count the number of times the column's value is not null. This in my experience is pretty unlikely -- people rarely do this.

It is impossible to read the developer's mind in these cases, so I always end up getting stalled waiting for them to reply and tell me what the query means so I can optimize it.

The problem is when the COUNT() contains a column name, like this:

SELECT count(col1) FROM TABLE;

If you know your SQL well, you know COUNT() has two meanings. 1) count the number of rows 2) count the number of values. Sometimes, but not always, these are the same thing. COUNT(*) always counts the number of rows in the result. If you write COUNT(col1) it counts the number of times col1 is not null. If it's never null, the result is the same as the number of rows.

The problem with that is that you don't know by looking at the query whether the developer wanted to count rows or values -- or, quite possibly, the number of distinct values in that column. You see, there's another form for COUNT():

SELECT count(DISTINCT col1) FROM TABLE;

A common problem when optimizing COUNT()

Related Links

Professional Services

Open source software