Pete On Software

RSS Feed

Archive for 'SQL'

Podcast Episode 13 – Itzik Ben-Gan on T-SQL

Itzik Ben-GanIn this week’s episode, I interviewed Itzik Ben-Gan. This was a huge treat for me because Itzik is someone that I’ve learned a lot from and looked up to for almost a decade. This wasn’t even an interview that I would have chased down on my own, but I will say that doing the interview has given me confidence to chase down some others.

If I didn’t land this interview for myself, then who did? Jeff Meyer (another Itzik fan) reached out to Mr. Ben-Gan on my behalf and got him to agree to come on the show and then looped me in to make the necessary appointments. All along the way, I was a little nervous about talking with Itzik. You know what they say about meeting your heroes…

In this case, that fear was totally unfounded. Itzik was amazingly nice and right off the bat, I could tell that he was exactly how I had hoped that he’d be. The guy just KNOWS T-SQL. We had a great conversation about T-SQL features: those that are underused, those that are powerful, and those that aren’t there yet – but should be. We also talked about his teaching and writing style and along the way collected some great resources.

Even if you aren’t a developer, you should listen to this week’s podcast. It is a pleasure to listen to a passionate expert talk about their craft.

You can also subscribe to the podcast at any of these places:
iTunes Link RSS Feed Listen on Stitcher

Thanks to all the people who listen, and a special thanks to those who have rated me. I really appreciate it.

The episodes have been archived. Click Here to see the archive page.

Row_Number() Part Two

Last Time, we looked at ROW_NUMBER() and saw a little of how it works and how we could use it to remove duplicates. Another one of my favorite uses for ROW_NUMBER() is to get only one row from a one-to-many join. I’ve had situations at work where I need to return one row per Person, but each person had several addresses, several injury codes, and several contacts (people working their case).

In many of those cases, there is nothing to indicate which one of those rows should receive precidence over the others. One might just select from those tables and then take just the MIN() of the primary key, grouping by all other fields, but that makes for very messy and hard to maintain code in my opinion. ROW_NUMBER() gives us a much easier way to handle this problem.

Examine this sql script that gives us a Person Table and a PersonAddress table. One person can have many addresses, but there is no way in the schema to give precidence to one of the addresses over any of the others.

Here is a sample of the results:
One to Many Addresses

Using a similar trick to what we used last time to find unique records, we can alter our select statement so that we only return one address per person. If we change our select statement to this

We get the following results:
One to Many Addresses With Row Numbers

You can see now that all we have to do is limit the result where the RowNum is equal to one and we have something that we can use (the CTE) in further joins or in reports to ensure that we don’t introduce duplicates to our query.

Our results:
One to One Addresses Thanks to RowNumber

Row_Number() Part One

Sometimes when you are performing queries, you have data that does not have a primary key and contains duplicate records. If it had had a primary key, you could group on the columns that constitute a duplicate and just keep the MIN or MAX of the primary key, deleting the others. However, lacking that primary key, you have no way of uniquely identifying a row. Additionally, UNIQUEIDENTIFIER types don’t work with MAX and MIN, so you’d have a hard time picking a row in that case as well.

One option is of course to add an identity column to the table and do the operation that I described. However, sometimes you don’t have the appropriate permissions or just aren’t allowed to modify the table to do that. You can use the ROW_NUMBER() function to assign a unique row number to each row by partition. I’ll talk a little more about partition later, but when not explicitly specified in the query, the partition refers to the entire table.

Examine the script below. (All Sql Code contained here will only work on Sql Server 2005 and later)

You get the results shown here:
Initial duplicates.

If we want to add a rownumber to the table, we would alter the part of the query that is

to say this

and get these results
Initial row number.

That helps some, but wouldn’t it be better if we could get a row number assigned within groups? That is what the PARTITION BY clause does. Change the SELECT sql again so that it now looks like this (note: you can PARTITION BY and ORDER BY any columns you want, and they can be the same or different from each other)

Now, we get this
Row Number with Partition

Okay, now all we have to do is keep the records that get assigned a one, forsaking all others. Change that same section of code again to read like this and then run the script. Note, I am using the CTE syntax here.

Now we have a table with no duplicates.
No more duplicates

That’s it, that is all there is to it. In my next blog, I will cover how we can use this trick in joins to ensure that we only get a 1 to 1 join, even when the data is 1 to Many.

Common Table Expressions

CTE (Common Table Expression) syntax can be a little tricky at first, but once you have a little primer, you will love using them as much as I do. Examine the following code below that uses a derived table from which to select. (Please realize that this is a contrived example and you could of course just join the tables in the query directly.)

The query between the parentheses creates a table structure (that I aliased as “c” and “t” for my two) that only lasts for the life of the query. I can then reference those tables in the query in any way that I would a regular table. However, that syntax can make it very difficult to read, especially in long queries with several complicated derived tables.

Sql Server 2005 introduced a new syntax to be used instead called Common Table Expressions (CTEs). Using a CTE, the above query would be written like this

Alternatively, you can also explictly declare you column names in the CTE definition and not have to alias the columns in the query itself.

One “gotcha” is that the CTE – beginning with the WITH keyword – must be the first statement that is encountered. All previous statements above it must be terminated with a GO or a semicolon. If you ran the following

You’d get the errors shown below (or something very close)

Some people even just make a habit of actually prefixing the WITH with a semicolon every time to make sure that everything goes off without a hitch. If I do run this script instead there are no issues.

This is just a quick primer on CTEs, but I hope that you can find it useful. I really find that it cleans my code up quite a bit.

Creating a Comma Separated List From a Sql Grouping

A few times recently when I’ve been creating a view for full text indexing on Sql Server, I’ve had need to return the results of a one to many relationship in one row. I didn’t need dynamic columns or anything, what would suit me best would be to have all of the values just in one comma separated list. After searching the net, I found an ingenious use of FOR XML PATH to solve this problem.

Here is the standard query to return the list of employees and their sales territories from the Northwind database.

The resulting set looks like the following:
Result set with no grouping or concatenation.

What might be best is some kind of built in function that did the following (kind of like C#’s string.Join()):

However, nothing like that was to be found. What I did find was someone who used FOR XML PATH in an interesting way. Normally, FOR XML PATH would return some normal looking xml, however if you pass in the ” as the argument, you just end up with all of the values that would be in an xml string, but with no resulting tag soup.

Running the following code:

Gives the following result:
Result set with grouping and concatenation.

This does do a subselect, but I’ve found that it doesn’t hurt performance too badly. The only other trick here is the STUFF() function, which replaces part of a string with another. In this case, I’m replacing only the first character (the 1, 1) arguments with just an empty string, effectively removing our extra comma.

I hope that someone may have found this useful and if you have any other techniques for solving this kind of problem, I’d love to hear them.

« Older Entries   Recent Entries »