Blog

PostgreSQL Expert Insight Series – Blog 2 of 4: Checking text within a larger string

Expert Insight Series visual 2-4

Hello and welcome to the second of my blogs. If you’ve picked up the first one: Blog 1 of 4: Aggregation, thanks for reading! So as I said before, I’ve always admired the desire of the PostgreSQL community to share knowledge and expertise. Following in that tradition I’d like to share a few pearls of PostgreSQL wisdom that I’ve picked up over the years as a PostgreSQL developer.

Now let’s get started with blog 2 of4.

2 of 4: Checking text within a larger string

A few years back, I worked for a Dutch financial organisation, so as you can imagine, account numbers are an essential part of it for them. The mark-up of bank account numbers has to be done on the basis of the IBAN standard. An overview of the countries using IBAN, including their format, can be found here: https://en.wikipedia.org/wiki/International_Bank_Account_Number#IBAN_formats_by_country

In general, an IBAN consists of:

  • 2 letters.
    The 2-character country code relates to the ISO_3166-1 (Alpha-2) standard.
    See https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 . The chapter: “Decoding table” on this webpage gives you a complete overview of the 26×26 letter combination and the colour, indicating the status.
  • 2 numbers.
    In general, this number is used to check the validity of the IBAN (checksum).
  • A number of letters.
    Indicating the bank code of the country.
  • Several numbers.

As I said before, it’s a general mark-up, because as you might expect…there are exceptions.
However, in the Netherlands it’s in the format: NL99 XXXX 9999 9999 99 with the “XXXX” being the bank code. These are in line with the first four letters of the Business Identifier Code (BIC).

See: https://en.wikipedia.org/wiki/ISO_9362

To check if this 4-letter combination for a specific bank exists within the IBAN, there are several options you can consider. Here’s a quick summary of the options you have in PostgreSQL:

  • LIKE
  • ILIKE (as a reminder, this is a case insensitive version of the LIKE operator)
  • ~
  • ~* (as a reminder, this is a case insensitive version of the ~ operator)
  • SUBSTRING
  • POSITION
  • SIMILAR TO
  • RIGHT in combination with LEFT

But which one is the fastest and why is this important? There are two reasons why…

  1. If you have to process a lot of records within a batch and you want to determine whether the money should be transferred to another account within the bank or whether it should be transferred to another bank; then you need to check that the beneficiary has the same 4 letters as the bank from which the transaction is being made.
  2. As a customer, you could transfer money to another account. By optimizing network and database activity the bank could save several milliseconds. Those milliseconds may not sound important to you as a customer, but from the perspective of the bank, which processes millions of transactions every day, it saves resources and possibly clogging up the process, which means you as a customer has to wait for the click of a button to transfer the money.
    To give you can idea, let’s assume 5 million transactions a day and you can save just 1 millisecond. This saves 5,000 seconds, which is almost 1½ hours of waiting time (busy waiting).

Returning to the possibilities I mentioned above, the following fragments can be used to determine the bank code contained in the IBAN. I will use “XXXX” as the bank code in these snippets.

Alternative Average duration on a raspberrypi v2 of 1 mln executions
iban LIKE  ‘____XXXX%’ 00:01:24.485594
iban ILIKE ‘____XXXX%’ 00:01:23.221809
iban ~  ‘….XXXX’ 00:01:24.454268
iban ~* ‘….XXXX’ 00:01:25.413386
substring(iban from 5 for 4) = ‘XXXX’ 00:01:25.751606
position(‘XXXX’ in iban) = 5 00:01:23.835435
iban similar to ‘____XXXX%’ 00:01:23.695471
right(left(iban, 8), 4) = ‘XXXX’ 00:01:22.500365

 

I did some research on different machines, from a (low end) raspberry-pi v2 to a (high end) production machine with many cores and a lot of memory. But it made no significant difference on 1 million executions (“raspi 2”) or 10 million of the alternatives (different approaches for the same machine).  Of course, the “raspi 2” was much slower. The results of the “raspi 2” are shown above as average durations for each alternative.

Sometimes it’s nice that ‘time’ apparently solves a problem. I experimented with search years ago with PostgreSQL 9.0. At the time, I was using completely different underlying hardware and operating system. The results were that (i)like, the ~ and ~* were considerably slower than using position. So if you’re upgrading your database, it’s worth taking this into account!

Make sure you check out Blog 3 of 4: Primary key considerations.

What are other people reading?

Want to know more?