Calculating differences between rows in SQL - Cybertec


Recently we had some clients who had the desire to store timeseries in PostgreSQL. One of the questions, which seems to interest people in this area, is related to calculating the difference between values in timeseries data. How can one calculate the difference between the current and the previous row?
To answer this question I have decided to share some simple queries outlining what can be done. Note that this is not a complete tutorial about analytics and windowing functions but just a short introduction to what can be done in general.

Loading sample data

Let us load some sample data:

cypex=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); cypex=# COPY t_oil FROM PROGRAM 'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt';
COPY 644

If you are a superuser you can use COPY … FROM PROGRAM directly. Otherwise you have to load the text file in a different way.

lag: How to access a different row in SQL

If you want to calculate the difference between two rows, you can make use of the “lag” function. However, there is a question naturally arising here: If we want to access the previous row. What is the previous row? We will need some kind of order. To do that SQL provides the OVER-clause:

cypex=# SELECT country, year, production, lag(production, 1) OVER (ORDER BY year) FROM t_oil WHERE country = 'USA' LIMIT 10; country | year | production | lag ---------+------+------------+------- USA | 1965 | 9014 | USA | 1966 | 9579 | 9014 USA | 1967 | 10219 | 9579 USA | 1968 | 10600 | 10219 USA | 1969 | 10828 | 10600 USA | 1970 | 11297 | 10828 USA | 1971 | 11156 | 11297 USA | 1972 | 11185 | 11156 USA | 1973 | 10946 | 11185 USA | 1974 | 10461 | 10946
(10 rows)

In my example I have ordered the data by year to make sure that the previous year can indeed be found in the previous row. Once we found the correct row the rest is simple:

cypex=# SELECT country, year, production - lag(production, 1) OVER (ORDER BY year) AS diff FROM t_oil WHERE country = 'USA' LIMIT 10; country | year | diff ---------+------+------ USA | 1965 | USA | 1966 | 565 USA | 1967 | 640 USA | 1968 | 381 USA | 1969 | 228 USA | 1970 | 469 USA | 1971 | -141 USA | 1972 | 29 USA | 1973 | -239 USA | 1974 | -485
(10 rows)

What is important to see here, is that the first row contains a NULL entry because there is no known difference to the previous row.

What many people need is the difference between the current and the first row in the data set. PostgreSQL (or ANSI SQL in general to be more precise) offers the “first_value” function which returns the first row given the order provided by us. Here is how it works:

cypex=# SELECT t_oil.country, t_oil.year, t_oil.production, t_oil.production - first_value(t_oil.production) OVER (ORDER BY t_oil.year) AS diff_first FROM t_oil WHERE t_oil.country = 'USA'::text LIMIT 10; country | year | production | diff_first ---------+------+------------+------------ USA | 1965 | 9014 | 0 USA | 1966 | 9579 | 565 USA | 1967 | 10219 | 1205 USA | 1968 | 10600 | 1586 USA | 1969 | 10828 | 1814 USA | 1970 | 11297 | 2283 USA | 1971 | 11156 | 2142 USA | 1972 | 11185 | 2171 USA | 1973 | 10946 | 1932 USA | 1974 | 10461 | 1447
(10 rows)

As you can see, in this case everything is relative to the first row. To visualize the results I have quickly built a mini CYPEX dashboard:

cypex visualization

The data looks correct so we can move on to the next examples.

Mixing data sets in analytics

But what happens if we start to look at two countries? If we order by year we might hit the wrong row. If we order by both columns we might still hit a row associated to a different country. The solution is the PARTITION BY clause. PostgreSQL will break up the data in various groups and calculate the difference again (for each group). Here is how it works:

cypex=# SELECT country, year, production, lag(production) OVER (PARTITION BY country ORDER BY year) AS diff FROM t_oil WHERE country IN ('Canada', 'Mexico') AND year < 1970; country | year | production | diff ---------+------+------------+------ Canada | 1965 | 920 | Canada | 1966 | 1012 | 920 Canada | 1967 | 1106 | 1012 Canada | 1968 | 1194 | 1106 Canada | 1969 | 1306 | 1194 Mexico | 1965 | 362 | Mexico | 1966 | 370 | 362 Mexico | 1967 | 411 | 370 Mexico | 1968 | 439 | 411 Mexico | 1969 | 461 | 439
(10 rows)

In this example each group contains a NULL value because there is no “previous” value. This is proof that PostgreSQL handles the groups separately.

Using sliding windows

One more thing many people are interested in is the necessity to calculate moving averages. I decided to include this example in this post about differences because the problem pops up so often that it deserves some more attention. In many cases this type of operation is calculated on the application level which is clearly the wrong place to do because of performance reasons:

cypex=# SELECT country, year, production, avg(production) OVER (ORDER BY year ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS mov FROM t_oil WHERE country IN ('Saudi Arabien') AND year BETWEEN 1975 AND 1990; country | year | production | mov ---------------+------+------------+----------------------- Saudi Arabien | 1975 | 7216 | 8465.6666666666666667 Saudi Arabien | 1976 | 8762 | 8487.7500000000000000 Saudi Arabien | 1977 | 9419 | 8758.4000000000000000 Saudi Arabien | 1978 | 8554 | 9369.2000000000000000 Saudi Arabien | 1979 | 9841 | 9668.0000000000000000 Saudi Arabien | 1980 | 10270 | 9176.4000000000000000 Saudi Arabien | 1981 | 10256 | 8455.8000000000000000 Saudi Arabien | 1982 | 6961 | 7394.4000000000000000 Saudi Arabien | 1983 | 4951 | 6060.6000000000000000 Saudi Arabien | 1984 | 4534 | 5051.0000000000000000 Saudi Arabien | 1985 | 3601 | 4578.6000000000000000 Saudi Arabien | 1986 | 5208 | 4732.4000000000000000 Saudi Arabien | 1987 | 4599 | 4952.6000000000000000 Saudi Arabien | 1988 | 5720 | 5653.4000000000000000 Saudi Arabien | 1989 | 5635 | 5764.7500000000000000 Saudi Arabien | 1990 | 7105 | 6153.3333333333333333
(16 rows)

In SQL you can use ROWS BETWEEN … PRECEDING AND … FOLLOWING. It defines the number of rows going into the aggregate function (in our case “avg”). The idea of a moving average is to flatten the curve and create a smoother line. The following picture shows, how this works:

Cypex dashboard

As you can see in the CYPEX visualization the moving average is a lot smoother than the rate underlying data. Saudi Arabia is a so called “swing producer”. Depending on the political situation the production rate might vary significantly so using a moving average actually makes a lot of sense.

Composite types and row comparisons

However, there is more: Some of you might know that PostgreSQL supports composite data type. Basically every row can be seen as a single element containing various components. Usually a SELECT clause lists all desired fields, but you can also see a table as a single field as shown in the next example:

cypex=# SELECT t_oil, lag(t_oil) OVER (ORDER BY year) FROM t_oil WHERE country = 'Mexico' AND year IN (1980, 1981); t_oil | lag -----------------------------------------+----------------------------------------- ("North America",Mexico,1980,2129,1048) | ("North America",Mexico,1981,2553,1172) | ("North America",Mexico,1980,2129,1048)
(2 rows)

In this case all columns of a row are packed into a single field. You can use the “lag” function normally …

The trick now is: You can use “=” to compare two rows directly. Why is that important? Sometimes you want to see if two rows were imported twice or you simply want to know if two consecutive rows are identical. This is how it works:

cypex=# SELECT t_oil = lag(t_oil) OVER (ORDER BY year) FROM t_oil WHERE country = 'Mexico' AND year IN (1980, 1981); ?column? ---------- f
(2 rows)

It is possible to compare entire rows to each other. PostgreSQL will inspect one field after the other and only issue true in case all fields are the same. In other words: “lag” can even be abused to detect duplicate rows.