Hello and welcome to the third of my blogs. If you’ve picked up the first one: Blog 1 of 4: Aggregation and the second one: Blog 2 of 4: Checking text within a larger string, thanks for reading! So as I said before, I’ve always admired the desire of the PostgreSQL community to share knowledge and expertise. Following in that tradition I’d like to share a few pearls of PostgreSQL wisdom that I’ve picked up over the years as a PostgreSQL developer. Now let’s get started with blog 3.
3 of 4: Primary key considerations
At some point, you will probably be asked to clean up data. It could be because you’re switching to a different data model that requires data conversion from the old way of storing it. Or because the data you receive from other parties isn’t normalised. For example, you have the same address for different people on the lines to import.
Another way in which data can become contaminated is by defining a primary key as serial and forgetting to define a constraint to make it unique on a functional level. This may even be intentional as there is no uniqueness.
So to clean up a dataset you might run into questions about how to do it. I have a couple of solutions.
One way to remove duplicate data is to write a cursor with an
order by on all columns and if the current record is equal to the previous retrieved data, delete the current one. This is one way to do it, but for large data sets I wouldn’t recommend it. In the past, pre PostgreSQL 12, a table could have the system column “oid”, but this is no longer available. So what’s the alternative? Looking at the other system columns, a good candidate is the “ctid”. This is a reference to the physical location where the data is stored. To me, this is unique enough if you do it in one database transaction. You may still have to wait a long time before it’s ready. If so, consider doing it in chunks by defining partitions on the data yourself. For example, partitions on the column with the most different values.
For tables that need to be de-duplicated and are expected to be ready within a reasonable time, the following snippet could be used:
DELETE FROM table_with_dups t1 USING table_with_dups t2 WHERE t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND ... AND t1.colN = t2.colN AND t1.ctid < t2.ctid
ctid-columns, is the crux of deleting the duplicates.
As I suggested before, if you want to do it in chunks, just add an extra condition to one or more columns. In the end, however, all possible combination of values must be covered.
Of course a ‘reasonable’ time depends on, for example, the size of your table and how long a table can have row-exclusive locks without disturbing the application.
Another use of this
ctid column is to get quick access to records. For example, if you need to perform multiple manipulations on a record or multiple records, you could store the
ctid value in a variable or array. Yes, the data type is there for that, namely
tid. This might help if it makes your code more readable. If you had to do it via a primary key consisting of several columns, it will be much faster with the
ctid value. And in my opinion, it’s much easier to understand this way.
Another use is to collect the
ctid(s) alongside other data from a query. I suggest doing some calculations and processing, and use the collection of the
ctid(s) to store the result in the table(s) where the
ctid(s) came from. This can also be done within one query if, for example, you use Common Table Expressions
CTEs. So you start with a query that makes the main selection, and in another
CTE the processing takes place and the result is then used in the final query in combination with the
ctid's. To give you an idea of what I mean, see the abstracted code below.
WITH preparation AS ( SELECT b.ctid , b.batch_id FROM batch b WHERE b.status = 'NEW' ) , do_process AS ( SELECT prp.ctid , fnc.* FROM preparation prp JOIN LATERAL ( SELECT * FROM function(prp.batch_id) fnc ON (TRUE) ) UPDATE batch b SET status = 'DONE' , nr_of_transactions = dp.nr_trx FROM do_process dp WHERE b.ctid = dp.ctid
I know it is a little overdone, but it’s just to illustrate a possibility and it’s a nice bridge to my next blog. Make sure you check it out Blog 4 of 4: Moving data around.
What are other people reading?
June 20, 2023, the PostgreSQL community has made available the new minor releases of PostgreSQL 11 through 15. For us, this means that based on these new minor releases, we have to start working on compiling the new releases of PostgresPURE. Meanwhile, we have done...
Introducing the new release of Cortex and future developments. It took some time and we had some delay but finally the new release of Cortex is there. The delay was mainly caused by use cases given by our clients, which gave us a lot of inspiration to extend and improve the capabilities of Cortex. …
For me, migrating Oracle databases to PostgreSQL is a combination of science and art. Science is designed to be objective and guided by data and art is subjective and heavily influenced by feelings and opinions. Traditionally, art and science are treated as two separate disciplines, but when they are studied together, it becomes clear how one influences the other.