Blog

PostgreSQL Expert Insight Series – Blog 3 of 4: Primary key considerations

Expert Insight Series visual 3-4

Hello and welcome to the third of my blogs. If you’ve picked up the first one: Blog 1 of 4: Aggregation and the second one: Blog 2 of 4: Checking text within a larger string, thanks for reading! So as I said before, I’ve always admired the desire of the PostgreSQL community to share knowledge and expertise. Following in that tradition I’d like to share a few pearls of PostgreSQL wisdom that I’ve picked up over the years as a PostgreSQL developer. Now let’s get started with blog 3.

3 of 4: Primary key considerations

At some point, you will probably be asked to clean up data. It could be because you’re switching to a different data model that requires data conversion from the old way of storing it. Or because the data you receive from other parties isn’t normalised. For example, you have the same address for different people on the lines to import.

Another way in which data can become contaminated is by defining a primary key as serial and forgetting to define a constraint to make it unique on a functional level. This may even be intentional as there is no uniqueness.

So to clean up a dataset you might run into questions about how to do it. I have a couple of solutions.

One way to remove duplicate data is to write a cursor with an order by on all columns and if the current record is equal to the previous retrieved data, delete the current one. This is one way to do it, but for large data sets I wouldn’t recommend it. In the past, pre PostgreSQL 12, a table could have the system column “oid”, but this is no longer available. So what’s the alternative? Looking at the other system columns, a good candidate is the “ctid”. This is a reference to the physical location where the data is stored. To me, this is unique enough if you do it in one database transaction. You may still have to wait a long time before it’s ready. If so, consider doing it in chunks by defining partitions on the data yourself. For example, partitions on the column with the most different values.

For tables that need to be de-duplicated and are expected to be ready within a reasonable time, the following snippet could be used:

DELETE
FROM    table_with_dups t1
USING   table_with_dups t2
WHERE   t1.col1 = t2.col1
AND     t1.col2 = t2.col2
AND     ...
AND     t1.colN = t2.colN
AND     t1.ctid < t2.ctid
The last condition, the comparison on the ctid-columns, is the crux of deleting the duplicates.

As I suggested before, if you want to do it in chunks, just add an extra condition to one or more columns. In the end, however, all possible combination of values must be covered.

Of course a ‘reasonable’ time depends on, for example, the size of your table and how long a table can have row-exclusive locks without disturbing the application.

Another use of this ctid column is to get quick access to records. For example, if you need to perform multiple manipulations on a record or multiple records, you could store the ctid value in a variable or array. Yes, the data type is there for that, namely tid. This might help if it makes your code more readable. If you had to do it via a primary key consisting of several columns, it will be much faster with the ctid value. And in my opinion, it’s much easier to understand this way.

Another use is to collect the ctid(s) alongside other data from a query. I suggest doing some calculations and processing, and use the collection of the ctid(s) to store the result in the table(s) where the ctid(s) came from. This can also be done within one query if, for example, you use Common Table Expressions CTEs. So you start with a query that makes the main selection, and in another CTE the processing takes place and the result is then used in the final query in combination with the ctid's. To give you an idea of what I mean, see the abstracted code below.

WITH
    preparation AS
    (
    SELECT  b.ctid
    ,       b.batch_id
    FROM    batch b
    WHERE   b.status = 'NEW'
    )
,   do_process AS
    (
    SELECT  prp.ctid
    ,       fnc.*
    FROM    preparation prp
    JOIN    LATERAL
            ( SELECT * FROM function(prp.batch_id) fnc ON (TRUE)
    )
UPDATE  batch b
SET     status = 'DONE'
,       nr_of_transactions = dp.nr_trx
FROM    do_process dp
WHERE   b.ctid = dp.ctid

I know it is a little overdone, but it’s just to illustrate a possibility and it’s a nice bridge to my next blog. Make sure you check it out Blog 4 of 4: Moving data around.

What are other people reading?

New minor releases of PostgresPURE

New minor releases of PostgresPURE

PostgresPURE, 100% open source, combines community versions of PostgreSQL with a powerful set of open source components to launch and manage your PostgreSQL database environments, and comes with sustainable release management and 24/7 premium support. What did we do?...

Cortex 2.1 is released

Cortex 2.1 is released

Splendid Data is continuously working on Cortex with the aim of making the migration from Oracle databases to native PostgreSQL increasingly effective and efficient. The on-going development of Cortex takes into account the experiences of the many use cases both of...

Want to know more?