DBDeveloper: A Really Interesting Real World Data Issue

I was approached this week by a colleague that had a data issue. A column was showing duplicates but visually the values had to be different. How to figure out what was different? We would eventually want to consolidate all of the values to the same value but finding out the differences would be interesting and important - we wouldn't want to propagate the incorrect values.

Think about a table with a column and when you select distinct or select unique() or group by the values look alike but are different. How do you figure out what is different? The standard functions for strings in Oracle are okay but stripped down. There isn't a Left, Right or Mid like there might be in SQL Server or Access. There is substr(value, initial position, length). You could loop through and compare each character and exit the loop when something is different and that might give you where they are different. Then you could compare the ascii values of the characters at those positions to see what is different and decide upon a a good value and then update that character. But that is pretty complex and it takes a lot of manual intervention and you would not see an interesting issue this way.

I did some research and discovered a function called DUMP. Now, Oracle has taken a dump on my life on a daily basis for 12 years now. I finally get to take a DUMP on Oracle. Ah, revenge is sweet! Taking a dump of a column will give the type, length and ascii value of each character. This is brilliant! And native! I have obfuscated the data for an example that I can provide to you all. Imagine a table called dump_example:

create table dump_example (
dump_column varchar2(4000));

After adding in the values and we get this:

select * from dump_example;

"DUMP_COLUMN"
"20121130�EXAMPLE�TEXT�VALUE"
"20121130 EXAMPLE TEXT VALUE"
"20121130 EXAMPLE TEXT VALUE"

An interesting issue I ran into is the Length function and Dump diverging on the length for the given column:

If we select the length of the dump column from the dump example table

select dump_column, length(dump_column) len_dump_col from dump_example;

"DUMP_COLUMN" "LEN_DUMP_COL"

"20121130�EXAMPLE�TEXT�VALUE" 27

"20121130 EXAMPLE TEXT VALUE" 27

...we see that the length of all three values is 27. Now let's take a dump! Once done with that, we wipe and wash our hands with soap and hot water for a few minutes. And then, we do this:

select dump_column, dump(dump_column) len_dump_col from dump_example;

"DUMP_COLUMN" "DUMP_DUMP_COL"

"20121130�EXAMPLE�TEXT�VALUE" "Typ=1 Len=33: 50,48,49,50,49,49,51,48,239,191,189,69,88,65,77,80,76,69,239,191,189,84,69,88,84,239,191,189,86,65,76,85,69"

"20121130 EXAMPLE TEXT VALUE" "Typ=1 Len=27: 50,48,49,50,49,49,51,48,32,69,88,65,77,80,76,69,32,84,69,88,84,32,86,65,76,85,69"

"20121130 EXAMPLE TEXT VALUE" "Typ=1 Len=30: 50,48,49,50,49,49,51,48,194,160,69,88,65,77,80,76,69,194,160,84,69,88,84,194,160,86,65,76,85,69"

Notice the divergent length values. So, something in the selecting and displaying of the values suppresses the multiple white space and somehow the Length function ignores those extras as well. It is the ninth character that begins the divergence in actual value so we need to figure out what the character is related to 32, 194, and 239. If we go to http://www.asciitable.com/ we can see that 32 is the space and the others are on the extended character map and we should probably go with the space as the good character. We can update the other values to consolidate the table but that doesn't explain how two native functions in Oracle can get divergent results like they did. I still haven't quite figured out how or why that is happening. I've been considering posting this out on some expert sites and see if I get any information.

I was really pleased that this issue popped up this week. I was intrigued by the issue and enjoyed the investigation process and the new function that I learned about. Has anything happened recently in your daily work load that caused you to learn something new and exciting?

I plan on getting back to the introductory stuff in the next blog.

DBDeveloper

Friday, November 30, 2012

A Really Interesting Real World Data Issue

1 comment: