In "A", rows 1-1000, I generate random numbers using RAND(). I want to pick numbers between say 1-100.

In "B" for 1000 rows I have `=RAND()*(100-1)+1`

In "C" for 1000 rows I use the numbers that I generated in "A", so `=A1*(100-1)+1`

The averages in B and C are the same, as expected.

I only want to use the values for 20% of the time, so for each of the 1000 rows:

For B, I use `=IF(A1<0.2,B1,0)`

For C, I use `=IF(A1<0.2,C1,0)`

If I do this, the averages for each of the two columns are vastly different.

**Can anybody explain why?**

Both approaches are using random numbers. The first approach uses two different random numbers (from column "A" and the one used in the "B" column calculation).

## Best Answer

Columns A and B have the same uniform distribution, aside from the linear transformation, but are independant. B's probability density function (PDF) for

A<0,2is the same as the PDF for any other selection of A.Columns A and C are not independant. In fact, they are completely deterministic. If you select rows where

A<0,2, you are only selecting rows whereC<20,8. The probability of finding a value of C greater than 20,8 within that selection has dropped to zero. Clearly, the distribution has changed.For a more obvious example of this, consider rolling two dice (A and B) and their total (C). I'd expect to see an average of 3,5 in columns A and B, and an average of 7 in column C. If from this table I only select rows where die A landed on 1, I'd still see an average of 3,5 in B (independant), but merely 4,5 instead of 7 in column C (dependant).