Hello and welcome to the second of my blogs. If you’ve picked up the first one: Blog 1 of 4: Aggregation, thanks for reading! So as I said before, I’ve always admired the desire of the PostgreSQL community to share knowledge and expertise. Following in that tradition I’d like to share a few pearls of PostgreSQL wisdom that I’ve picked up over the years as a PostgreSQL developer.
Now let’s get started with blog 2 of4.
2 of 4: Checking text within a larger string
A few years back, I worked for a Dutch financial organisation, so as you can imagine, account numbers are an essential part of it for them. The mark-up of bank account numbers has to be done on the basis of the IBAN standard. An overview of the countries using IBAN, including their format, can be found here: https://en.wikipedia.org/wiki/International_Bank_Account_Number#IBAN_formats_by_country
In general, an IBAN consists of:
- 2 letters.
The 2-character country code relates to the ISO_3166-1 (Alpha-2) standard.
See https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 . The chapter: “Decoding table” on this webpage gives you a complete overview of the 26×26 letter combination and the colour, indicating the status. - 2 numbers.
In general, this number is used to check the validity of the IBAN (checksum). - A number of letters.
Indicating the bank code of the country. - Several numbers.
As I said before, it’s a general mark-up, because as you might expect…there are exceptions.
However, in the Netherlands it’s in the format: NL99 XXXX 9999 9999 99 with the “XXXX” being the bank code. These are in line with the first four letters of the Business Identifier Code (BIC).
See: https://en.wikipedia.org/wiki/ISO_9362
To check if this 4-letter combination for a specific bank exists within the IBAN, there are several options you can consider. Here’s a quick summary of the options you have in PostgreSQL:
- LIKE
- ILIKE (as a reminder, this is a case insensitive version of the LIKE operator)
- ~
- ~* (as a reminder, this is a case insensitive version of the ~ operator)
- SUBSTRING
- POSITION
- SIMILAR TO
- RIGHT in combination with LEFT
But which one is the fastest and why is this important? There are two reasons why…
- If you have to process a lot of records within a batch and you want to determine whether the money should be transferred to another account within the bank or whether it should be transferred to another bank; then you need to check that the beneficiary has the same 4 letters as the bank from which the transaction is being made.
- As a customer, you could transfer money to another account. By optimizing network and database activity the bank could save several milliseconds. Those milliseconds may not sound important to you as a customer, but from the perspective of the bank, which processes millions of transactions every day, it saves resources and possibly clogging up the process, which means you as a customer has to wait for the click of a button to transfer the money.
To give you can idea, let’s assume 5 million transactions a day and you can save just 1 millisecond. This saves 5,000 seconds, which is almost 1½ hours of waiting time (busy waiting).
Returning to the possibilities I mentioned above, the following fragments can be used to determine the bank code contained in the IBAN. I will use “XXXX” as the bank code in these snippets.
Alternative | Average duration on a raspberrypi v2 of 1 mln executions |
iban LIKE ‘____XXXX%’ | 00:01:24.485594 |
iban ILIKE ‘____XXXX%’ | 00:01:23.221809 |
iban ~ ‘….XXXX’ | 00:01:24.454268 |
iban ~* ‘….XXXX’ | 00:01:25.413386 |
substring(iban from 5 for 4) = ‘XXXX’ | 00:01:25.751606 |
position(‘XXXX’ in iban) = 5 | 00:01:23.835435 |
iban similar to ‘____XXXX%’ | 00:01:23.695471 |
right(left(iban, 8), 4) = ‘XXXX’ | 00:01:22.500365 |
I did some research on different machines, from a (low end) raspberry-pi v2 to a (high end) production machine with many cores and a lot of memory. But it made no significant difference on 1 million executions (“raspi 2”) or 10 million of the alternatives (different approaches for the same machine). Of course, the “raspi 2” was much slower. The results of the “raspi 2” are shown above as average durations for each alternative.
Sometimes it’s nice that ‘time’ apparently solves a problem. I experimented with search years ago with PostgreSQL 9.0. At the time, I was using completely different underlying hardware and operating system. The results were that (i)like
, the ~ and ~* were considerably slower than using position
. So if you’re upgrading your database, it’s worth taking this into account!
Make sure you check out Blog 3 of 4: Primary key considerations.
What are other people reading?
Why PostgreSQL is my favourite database
I’ve been working with database systems for over 20 years, including Oracle, Sybase and MySQL – but for me, PostgreSQL is the best of them all. So what makes PostgreSQL such a great database system? In all honesty, having used it for over a decade, I could fill many pages with my answer. But …
PostgreSQL Expert Insight Series – Blog 4 of 4: Moving data around
So here we are at the last of my series of four blogs. If you’ve picked up the other three: Blog 1 of 4: Aggregation, Blog 2 of 4: Checking text within a larger string and Blog 3 of 4: Primary key considerations, thanks for reading! And watch this space, I’m preparing...
PostgreSQL Expert Insight Series – Blog 3 of 4: Primary key considerations
Hello and welcome to the third of my blogs. If you’ve picked up the first one: Blog 1 of 4: Aggregation and the second one: Blog 2 of 4: Checking text within a larger string, thanks for reading! So as I said before, I’ve always admired the desire of the PostgreSQL...