About Me

My photo
"Enoughtheory.com" had its humble beginning in the year 2011 by ( Founder of Enoughtheory.com ) Mr Ravi Kant Soni , an Enterprise Java and Spring Framework Specialist, with a bachelor degree (B.E) in Information Science and Engineering from Reva Institute of Technology at Bangalore. He has been into the software development discipline for many years now. Ravi has worn many hats throughout his tenure, ranging from software development, designing multi-tenant applications, integration of new technology into an existing system, to his current love of writing a Spring Framework book. Currently, he is a lead engineer at HCL Technology. Ravi has focused on Web and Enterprise development using Spring Framework for most of his career and has been extensively involved in application design and implementation. He has developed applications for Core-Bank, HR and Payroll System, and e-Commerce systems using Spring Framework. Ravi Kant Soni is author of book "Learning Spring Application development" http://learningspringapplicationdevelopment.com

Wednesday 25 July 2012

GROUP BY, HAVING, SUM, AVG, and COUNT(*)


Aggregation

You can use a SQL SELECT to aggregate data. Aggregation combines rows together and performs some operation on their combined values. Very common aggregations are COUNT, SUM, and AVG.
The simplest use of aggregations is to examine an entire table and pull out only the aggregations, with no other columns specified. Consider this SQL:
SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sumSales
      ,AVG(sale_amount) as avgSales
  FROM orders
If you have a very small sales order table, say about 7 rows, like this:
ORDER |  DATE      | STATE | SALE_AMOUNT
------+------------+-------+-------------
 1234 | 2007-11-01 | NY    |       10.00
 1235 | 2007-12-01 | TX    |       15.00
 1236 | 2008-01-01 | CA    |       20.00
 1237 | 2008-02-01 | TX    |       25.00
 1238 | 2008-03-01 | CA    |       30.00
 1237 | 2008-04-01 | NY    |       35.00
 1238 | 2008-05-01 | NY    |       40.00
Then the simple query above produces a one-row output:
CNT  | SUM  | AVG
-----+------+-----
  7  | 175  |  25

Some Notes on The Syntax

When we use COUNT(*) we always put the asterisk inside.
Note that the example names the output columns by saying "as sumSales" and "as avgSales". This is important because without it we will get whatever the database server decides to call it, which will vary from platform to platform, so it is a good idea to learn to use the "AS" clause.

The WHERE Clause Filters BEFORE the Aggregation

If you want to get just the sales from New York state, you can put a WHERE clause in:
SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sumSales
      ,AVG(sale_amount) as avgSales
  FROM orders
 WHERE state = 'NY'
...and you will get only the results for NY:
CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33333
Notice of course that the average has a repeating decimal. Most databases have a ROUND function of some sort, so I can correct that with:
SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),2) as avg
  FROM orders
 WHERE state = 'NY'
...and get:
CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33

The Fun Begins With GROUP BY

The query above is fine, but it would be very laborious if you had to issue the query (or write a program to do it) for every possible state. The answer is the GROUP BY clause. The GROUP BY clause causes aggregations to occur in groups (naturally) for the columns you name.
SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount)          as sumSales
      ,ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
Which gives us this result:
STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
TX    |  2  |  40  |  20
CA    |  2  |  50  |  25  

Every Column a GROUP BY or Aggregate

When you use the GROUP BY column then every column in the output must either be a group by column or must be an aggregate function. To understand this, imagine we put "Date" into the query above:
SELECT state,
     , date -- huh?? which value should we get??
     , COUNT(*) as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
Several states have more than one row in the database, so the database server has to decide which value of DATE to give you. Since it cannot know which one you want, it throws an error and says in short, "don't confuse me!"

Two More Aggregations, MIN and MAX

If we think again about the DATE column, in most practical situations we usually want to know the smallest or largest value, or both, so this query is not uncommon:
SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
which yields:
STATE | minDate    | maxDate    |CNT | SUM  | AVG
------+------------+------------+----+------+-----
NY    | 2007-11-01 | 2008-05-01 | 3  |  85  |  28
TX    | 2007-12-01 | 2008-02-01 | 2  |  40  |  20
CA    | 2008-01-01 | 2008-03-01 | 2  |  50  |  25  

HAVING Clause is Like WHERE after GROUP BY

The HAVING clause lets us put a filter on the results after the aggregation has taken place. If your Sales Manager wants to know which states have an average sale amount of $25.00 or more, then the query would look like this:
SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount)          as sumSales
      ,ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
HAVING AVG(sale_amount) >= 25
Which gives us this result, notice that Texas is now missing, as they were just not selling big enough orders (sorry 'bout that Rhonda).
STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
CA    |  2  |  50  |  25  

When to use WHERE, When to use HAVING

Then the Sales Manager might come down and say, 'I don't want the states who have no sales after December 2008'. We might automatically code the following, which is tragically wrong:
SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 -- WRONG! Will filter out individual rows!
 WHERE date <= '2008-12-31'
 GROUP BY state
The problem here is that individual rows that happened after 2008-12-31 will get filtered out, which will give you all stats for all states on sales before 2009. That is not right. The idea is to completely eliminate all results for states with no sales in 2009 or later, even if they had sales before that time. So we use MAX and the HAVING clause:
SELECT state,
     , MIN(date)                 as minDate
     , MAX(date)                 as maxDate
     , COUNT(*)                  as cnt
     , SUM(sale_amount)          as sumSales
     , ROUND(AVG(sale_amount),0) as avgSales
  FROM orders
 GROUP BY state
HAVING MAX(date) >= '2008-12-31'

Using All Three

You can pull some pretty nice results out of a database in a single query if you know how to combine the WHERE, GROUP BY, and HAVING. If you have ever worked with a Sales Manager, you know they constantly want to know strange numbers, so let's say our Sales Manager says, "Can you tell me the average order size by state for all orders greater than 20? And don't bother with any average less 30.00" We say, "Sure, don't walk away, I'll print it out right now."
SELECT state
      ,COUNT(*)
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount) as avg
  FROM orders
 WHERE sale_amount > 20
 GROUP BY state
HAVING avg(sale_amount) >= 30
   AND max(date) >= '2008-12-31'

How to Do a Weighted Average

Consider the case of a table that lists test, homework and quiz scores for the students in a certain course. Each particular score is worth a certain percentage of a student's grade, and the teacher wants the computer to calculate each student's file score. If the table looks like:
STUDENT     | WEIGHT | SCORE
------------+--------+-------
NIRGALAI    |     40 |    90
NIRGALAI    |     35 |    95
NIRGALAI    |     25 |    85
JBOONE      |     40 |    80
JBOONE      |     35 |    95
JBOONE      |     25 |    70
PCLAYBORNE  |     40 |    70
PCLAYBORNE  |     35 |    80
PCLAYBORNE  |     25 |    90
Then we can accomplish this in one pull like so:
SELECT student
      ,SUM(weight * score) / 100 as final
  FROM scores
 GROUP BY student
The nice thing about this query is that it works even if data is missing. If a student missed a test, they automatically get a zero averaged in.

No comments:

Post a Comment