PostgreSQL – How to Count Occurrences of an Anchored String

countpostgresqlstringsubstring

If I have a string in a column on a row in a table like this

1 2 2 2 2 2 2

How would I count the occurrence of a substring 2 inside the string. Assume nothing other than a space-delimiter of " ".

For the purposes of this, let's treat the numbers as substrings

Sample data

CREATE TABLE foo
  AS
    SELECT 1 AS id, '1 2 2 2 2 2 2'::text AS data;

TABLE foo
 id |     data      
----+---------------
  1 | 1 2 2 2 2 2 2

Best Answer

You can solve this with a

FASTEST was the pl/perl method which I placed last on this list because it requires pl/perl, and is likely not needed for most workloads.

FAST A string function, such as one the pattern explained below

length(str) - length(*replace(str, replaceStr))
  / length(replaceStr)

Something that converts from a string to an array.
SLOW Something that converts from a string to a table.

Possible solutions

String

Using `length` and `regexp_replace`

Most RDBMS's provide some method to calculate substring occurrences like this,

SELECT length(data) - length(replace(data, '2', ''))
  / length('2')
FROM foo;

This method isn't applicable here because without an anchor, we can't be sure if we're replacing the substring inside of something-space-delimited. As an example, the above replaces the 2 in 329. We can remedy that by using regexp_replace to anchor the substring.

SELECT length(data) - length(regexp_replace(data, '\m2\M', '', 'g'))
  / length('2')
FROM foo;

Because we're not splitting on simple spaces (' ') though we could with more complexity, we may also want to accommodate sub-strings of different lengths like in this question. This is why we explicitly include / length('2'). That reduces to a no-op, but if we we're search for something that was longer than one character, it'd be required.

SELECT length(data) - length(regexp_replace(data, '\m42\M', '', 'g'))
  / length('42')
FROM foo;

Using ARRAY[]

Splitting into an `ARRAY[]`

Here we have to subtract one match splits a string into two fragments, and thus the occurrences is one less than the fragment counts: this xyx split on y, produces {'x', 'x'} and we want the length to be 1 corresponding to the occurrences of y.

SELECT array_length(x, 1) - 1
FROM foo
CROSS JOIN LATERAL regexp_split_to_array(data, '\m2\M') AS t(x);
-- un-anchored version for reference.
-- CROSS JOIN LATERAL string_to_array(data, '2') AS t(x);

Alternatively, we can use string_to_array to separate something that's space-delimited and then count the matches,

SELECT id, array_length(array_positions(x, '2'), 1)
FROM foo
CROSS JOIN LATERAL string_to_array(data, ' ') AS t(x);

Using a TABLE

Splitting into a table with `regexp_split_to_table`

Here we split the regex into a table.. In this method you're using GROUP BY and count().

SELECT id, x
FROM foo
CROSS JOIN LATERAL regexp_split_to_table(data, ' ')
  AS t(x);

 id | x 
----+---
  1 | 1
  1 | 2
  1 | 2
  1 | 2
  1 | 2
  1 | 2
  1 | 2
(7 rows)

And, from there you can run regular SQL on it.

SELECT id, x, count(*)
FROM (
    SELECT id, x
    FROM foo
    CROSS JOIN LATERAL regexp_split_to_table(data, ' ')
      AS t(x)
) AS t(id,x)
GROUP BY id, x;

 id | x | count 
----+---+-------
  1 | 1 |     1
  1 | 2 |     6

Using `regex_matches`

Here we get away from splitting and instead use the \m, and \M anchors for word-boundaries.

SELECT count(*)
FROM foo
CROSS JOIN LATERAL regexp_matches(data, '\m2\M', 'g');

Procedural Languages

Perl

This method turned out to the be the fastest overall,

CREATE LANGUAGE plperl

CREATE FUNCTION count_occurances(inputStr text, regex text)
RETURNS smallint
AS $BODY$
  scalar @{[ $_[0] =~ m/$_[1]/g ]} 
$BODY$
LANGUAGE plperl
IMMUTABLE;

Summary and Performance Implications

Following the same format of data, performance implications can be obtained with

CREATE TABLE foo
AS
  SELECT
    1 AS id,
    array_to_string(
      ARRAY(SELECT trunc(random()*100+1)::int % 100 FROM generate_series(1,5000) AS t(x)),
      ' '
    ) AS data
;

Under these constraints, I found that the procedural method with plperl was the fastest. Next I found the following as being the fastest native method,

length(str) - regexp_replace(str, replacement, g)
  / length(replacement)

Keep in mind the tried and true method of string replacement outside of needing to anchor the string is still the fastest and most efficient native method however clunky it may be,

length(str) - replace(str, replacement)
  / length(replacement)

That said, the ARRAY[] method is massively faster than splitting to a table.

Related Solutions

SQL Server – NTEXT Columns and String Manipulation

Converting to nvarchar(max) should work unless you are doing something wrong with your CHARINDEX()

Try this code snippet, it should output what you want.

-- Create the table
CREATE TABLE [dbo].[PhilsTable](
    [comment] [ntext] NULL,
    [anothercomment] [nvarchar](50) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];

GO

-- insert very long string
INSERT INTO [dbo].[PhilsTable] (comment, anothercomment) VALUES (N'This is a test UPDATEHERE This is the end of the test' + REPLICATE (CAST(N'x' AS nvarchar(max)), 1000000), 'this goes in here');

-- verify data
SELECT DATALENGTH(comment), *  FROM [dbo].[PhilsTable];

-- perform replace
SELECT CAST(REPLACE(CAST(comment AS NVARCHAR(MAX)),'UPDATEHERE','UPDATEHERE' + anothercomment) AS NTEXT) FROM [dbo].[PhilsTable];

DROP TABLE [dbo].[PhilsTable];

Thanks go out to Andriy M for helping out with the REPLICATE statement.

Postgresql – How to disable escaping when using COPY FROM in PostgreSQL

The copy command by default uses text format with tab delimiter. So only one thing you need is to escape backslashes:

copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │   ngram    │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║  1 │ aX13_X     │ 2006 │           8 │            5 ║
║  2 │ aX13_X     │ 2007 │           4 │            3 ║
║  3 │ aX13_X     │ 2008 │           2 │            1 ║
║  4 │ a\         │ 1852 │           1 │            1 ║
║  5 │ a\         │ 1935 │           1 │            1 ║
║  6 │ a\         │ 1937 │           2 │            2 ║
║  7 │ ACT1V1T1ES │ 2003 │          15 │           11 ║
║  8 │ ACT1V1T1ES │ 2004 │          63 │            6 ║
║  9 │ ACT1V1T1ES │ 2005 │           1 │            1 ║
║ 10 │ ACT1V1T1ES │ 2006 │           5 │            4 ║
║ 11 │ ACT1V1T1ES │ 2008 │           4 │            3 ║
║ 12 │ ACTION="   │ 1995 │           3 │            3 ║
║ 13 │ ACTION="   │ 1996 │           6 │            5 ║
║ 14 │ ACTION="   │ 1997 │           9 │            7 ║
║ 15 │ ACTION="   │ 1998 │          19 │           11 ║
║ 16 │ ACTION="   │ 1999 │          11 │            5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝