Splitting Strings in Oracle with the Model Clause

In database applications, it almost invariably happens that some information gets encoded into a comma separated string inside a column. After some time, it is recognized as an error in the modeling and this information must be split into individual values and inserted into a child table. The problem to find a efficient way to insert these values into the child table.

In Oracle, one can solve this problem with a hierarchical query or with PL/SQL [1]. In this post, I show an elegant solution with a query using the model clause [2], a feature introduced in Oracle 10g. With the model clause, it is possible to add columns and rows to the result set of a query and to fill theses generated rows and columns via sophisticated declarative rules. These rules describe the content of cells from the content of other cells. It is really like having a spreadsheet in SQL.

Lets begin and create a simple table entity with 2 columns: a primary key, a field for comma separated values and some dummy data.

create table entity (
    id number(19) primary key,
    values varchar2(2000)
); 

insert into entity values (1, 'abc,def,ghi');
insert into entity values (2, 'hello');
insert into entity values (3, 'foo,bar');
commit;

Now we want to normalize that information into a entity_val table.

create table entity_val (
	entity_id number(19) not null references entity(id),
	val varchar2(100),
	primary key (entity_id, val)
);

The solution using the model clause is the following insert as select query.

insert into entity_val
    select id, val
    from entity
    model
    partition by (id)
    dimension by (1 i)
    measures (
        vals vals,
        regexp_count(vals, ',') + 1 cnt,
        cast(null as varchar2(100)) val)
    rules (val[for i from 1 to cnt[1] increment 1] = regexp_substr(vals[1], '[^,]+', 1, CV(i)));

As you can see, the model clause is added at the end of a select sql statement. Here we select everything from the table entity and will build model(s) from the content of it.

The model clause allows actually the creation of many independent models by partitioning the result set returned by the primary sql statement. By partitioning by id, we indicate an independent model for each row of the table entity. The measures are additional computed columns and they are given dimensions to express their values relatively to other measures thanks to rules that are declared in the rule clause. We only need a simple numeric dimension i that has the start value of 1 representing the first extracted value. Later, we will extend the dimension i.

The first measure vals is the column vals itself to allow its use in the rules [3]. The second measure cnt is the number of values that has to be extracted from the column vals; as vals is encoded as comma separated string, it is the number of commas plus one; we extract this information with the function regexp_count. The third and last measure val represents one single extracted value. Because we will compute the real values for val with a rule, we just use null casted as varchar2 (as we extract individual strings).

The last part of the model clause are the rules that specify how the measures are computed from other measures. We have a single rule that will fill the measure val. At the beginning of the computation of the model, because we have partitioned by id and have a dimension i with start value 1, we have a simple model for each line of the table entity with 3 cells: vals[1] which contains the content of the column vals; cnt[1] containing the number of individual values, and val[1] containing null for the moment.

The model then applies the rule. On the left side of the equality, we have an expression of the type val[x] signaling that we want to compute the value of val at point x and on the right we have an expression whose value will be assigned to val[x]. In our case, x is the multi-cells expression for i from 1 to cnt[1] increment 1, which can be translated as for all integers between 1 and cnt[1]. The rule represents the computation of many cells. As for the moment only val[1] exists in the model, Oracle will create cnt[1]-1 new cells to accommodate for val[2] to val[cnt[1]]. The right side of our expression is a call to the regexp_substr function to extract the i-th non-comma word from vals[1]. The model clause function CV(i) return the current value of i.

Now we can return to the select clause of our sql statement: we select the columns id and val. Oracle will return one row for each combination of dimensions in the models. This means that we will have one row for each extracted value. You can run the following query to see the values for the dimension i and for all measures. The measures vals and cnt are only computed for i=1.

select id, i, vals, cnt, val 
from entity
model
partition by (id)
dimension by (1 i)
measures (
    vals vals,
    regexp_count(vals, ',') + 1 cnt,
    cast(null as varchar2(100)) val)
rules (val[for i from 1 to cnt[1] increment 1] = regexp_substr(vals[1], '[^,]+', 1, CV(i)))

[1]: Oracle 10g Splitting string into multiple rows

[2]: SQL for modeling

[3]: Rules