Friday, November 30, 2012

A Really Interesting Real World Data Issue

     I was approached this week by a colleague that had a data issue. A column was showing duplicates but visually the values had to be different. How to figure out what was different? We would eventually want to consolidate all of the values to the same value but finding out the differences would be interesting and important - we wouldn't want to propagate the incorrect values.

     Think about a table with a column and when you select distinct or select unique() or group by the values look alike but are different. How do you figure out what is different? The standard functions for strings in Oracle are okay but stripped down. There isn't a Left, Right or Mid like there might be in SQL Server or Access. There is substr(value, initial position, length). You could loop through and compare each character and exit the loop when something is different and that might give you where they are different. Then you could compare the ascii values of the characters at those positions to see what is different and decide upon a a good value and then update that character. But that is pretty complex and it takes a lot of manual intervention and you would not see an interesting issue this way.

     I did some research and discovered a function called DUMP. Now, Oracle has taken a dump on my life on a daily basis for 12 years now. I finally get to take a DUMP on Oracle. Ah, revenge is sweet! Taking a dump of a column will give the type, length and ascii value of each character. This is brilliant! And native! I have obfuscated the data for an example that I can provide to you all. Imagine a table called dump_example:


create table dump_example (
dump_column varchar2(4000));

     After adding in the values and we get this:

select * from dump_example;


"DUMP_COLUMN"
"20121130�EXAMPLE�TEXT�VALUE"
"20121130 EXAMPLE TEXT VALUE"
"20121130 EXAMPLE TEXT VALUE"
 
    An interesting issue I ran into is the Length function and Dump diverging on the length for the given column:
If we select the length of the dump column from the dump example table 

select dump_column, length(dump_column) len_dump_col from dump_example;

"DUMP_COLUMN" "LEN_DUMP_COL"
"20121130�EXAMPLE�TEXT�VALUE" 27
"20121130 EXAMPLE TEXT VALUE" 27
"20121130 EXAMPLE TEXT VALUE" 27

...we see that the length of all three values is 27. Now let's take a dump! Once done with that, we wipe and wash our hands with soap and hot water for a few minutes. And then, we do this:

select dump_column, dump(dump_column) len_dump_col from dump_example;

"DUMP_COLUMN" "DUMP_DUMP_COL"
"20121130�EXAMPLE�TEXT�VALUE" "Typ=1 Len=33: 50,48,49,50,49,49,51,48,239,191,189,69,88,65,77,80,76,69,239,191,189,84,69,88,84,239,191,189,86,65,76,85,69"
"20121130 EXAMPLE TEXT VALUE" "Typ=1 Len=27: 50,48,49,50,49,49,51,48,32,69,88,65,77,80,76,69,32,84,69,88,84,32,86,65,76,85,69"
"20121130 EXAMPLE TEXT VALUE" "Typ=1 Len=30: 50,48,49,50,49,49,51,48,194,160,69,88,65,77,80,76,69,194,160,84,69,88,84,194,160,86,65,76,85,69"

Notice the divergent length values. So, something in the selecting and displaying of the values suppresses the multiple white space and somehow the Length function ignores those extras as well. It is the ninth character that begins the divergence in actual value so we need to figure out what the character is related to 32, 194, and 239. If we go to http://www.asciitable.com/ we can see that 32 is the space and the others are on the extended character map and we should probably go with the space as the good character. We can update the other values to consolidate the table but that doesn't explain how two native functions in Oracle can get divergent results like they did. I still haven't quite figured out how or why that is happening. I've been considering posting this out on some expert sites and see if I get any information. 

I was really pleased that this issue popped up this week. I was intrigued by the issue and enjoyed the investigation process and the new function that I learned about. Has anything happened recently in your daily work load that caused  you to learn something new and exciting?

I plan on getting back to the introductory stuff in the next blog. 

Thursday, November 8, 2012

How to add, delete, and change data in a database.

I recently went into a little bit of detail about databases and the SELECT statement. That is useful when you want to see what is in your database and do reporting, etc. But, this presumes that the data already exists. What if you have a new database with tables and things but have no data? What if you want to change some data? What if you want to remove data from your table(s)? Where did your mom learn that little trick with her tongue? These are all compelling questions.

There are utilities out there that can load data in bulk into a table but let us focus on the native SQL statements that achieve our desired results. INSERT, UPDATE, DELETE and Tijuana, I'm guessing. Let's say we have a table called RECIPE and this table has these columns:
RECIPE_ID, NAME, TYPE, SOURCE, DATE_CREATED, DATE_MODIFIED
We want to add a record to this table. There are some intricacies here but let's keep it simple.
Your insert statement might look something like this:
INSERT INTO RECIPE VALUES (1, 'Chocolate Chip Cookies', 'Dessert', 'Grandma Betty', sysdate, sysdate);
NOTE: Sysdate is a mechanism to get the current date from the Oracle database. Many systems use it to keep track of when something happens by selecting it and storing it as we have done.
If you run this and commit it, you will have one record in your table. If you have auto commit turned on you are a maniac and an unsavory sort and I want nothing to do with you. Auto commit is a feature that, you guessed it, commits everything you run automatically. You might be thinking this saves you from forgetting to commit or it saves you a step. In my experience, it adds several steps when you have to unfuck whatever you just fucked up. Use with extreme caution.

Now, let's say you are on the phone with Grandma Betty and tell her that you added her cookie recipe and she says, "Well, you know, Dear, that is your Great Great Grandmother Ester's recipe." Oh Noes! What do? Calm the fuck down! I got this. You need to update your record.
UPDATE RECIPE SET SOURCE = 'Great Great Grandma Ester', DATE_MODIFIED = sysdate WHERE NAME = 'Chocolate Chip Cookies';
Commit;
You may actually use the RECIPE_ID in the where clause in place of the NAME but I wanted to keep it understandable. Hopefully this is understandable...

Now, let's say you try these cookies and they taste like Satan's taint. Great Great Grandma Ester went full retard and her recipe is awful and must be exterminated with extreme prejudice. No worries, Brah.
DELETE FROM RECIPE WHERE NAME = 'Chocolate Chip Cookies' AND SOURCE = 'Great Great Grandma Ester';
Commit that shit, yo.

With these tools, you can add, remove and modify data from your database as you desire. Now, I'm off to your mom's house. Toodles.