Tuesday, February 21, 2012

Sample rows by criteria

This might be simple, but no matter how I try, I cannot get it working.

I want to select a sample, say 3 rows, of a table: SELECT TOP 3 * FROM table - that was easy.

Now say this table has a criteria column, an integer, with around 50 unique values. The whole table has 3 mill rows, so each criteria is seen between "a lot" and maybe only once for some others - but that should not be the issue here.

I'd like to get 3 sample rows FOR EACH value in the criteria column - or only one or two if only so many matches exist.

How could this be done? The table does have a unique primary key, int, if this is needed for the operation. Any ideas?

To select random samples take advantage of (1) ROW_NUMBER and (2) NEWID() function; for example:

declare @.example table
( rid integer,
criteria tinyint,
sampleValue smallint,
filler char(200)
)

insert into @.example
select iter,
1 + 49.9999 * dbo.rand(),
1200 * dbo.rand(),
'Record # ' + convert(varchar(6), iter)
from small_iterator (nolock)

select criteria,
rid,
seq,
sampleValue
from ( select criteria,
rid,
row_number ()
over ( partition by criteria
order by newid()
)
as seq,
sampleValue
from @.example
) x
where seq <= 3
order by criteria, seq

-- criteria rid seq sampleValue
-- -- -- -- --
-- 1 2242 1 197
-- 1 10438 2 204
-- 1 9782 3 881
-- ...
-- 50 15673 2 455
-- 50 5825 3 482

The SMALL_ITERATOR and DBO.RAND() functions used to create the sample data can be found here:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1330536&SiteID=1

No comments:

Post a Comment