PowerBASIC Forums
  Cafe PowerBASIC
  Dicing With Probability (Page 7)

Post New Topic  Post A Reply
profile | register | preferences | faq | search

UBBFriend: Email This Page to Someone!
This topic is 10 pages long:   1  2  3  4  5  6  7  8  9  10 
next newest topic | next oldest topic
Author Topic:   Dicing With Probability
Charles Pegge
Member
posted February 03, 2007 01:27 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Emil,
I think I managed to avoid both the pitfalls you mentioned by taking
the frequencies, not the proportions, and omitting the Chi from the tails
where the expected frequency falls below 1. Maybe it ought to be 2.

------------------
www.pegge.net

IP: Logged

David Roberts
Member
posted February 03, 2007 05:09 PM     Click Here to See the Profile for David Roberts     Edit/Delete Message   Reply w/Quote
It never ceases to amaze me how we manage to not be able to see the wood for the trees.
quote:
Assuming that RND is uniformly distributed then there is an x% probability that it will produce results which only occur x% of the time.

That is true.
quote:
This leads to a remarkable rule of thumb.

That is false.

There is nothing remarkable about it at all.

What would have been remarkable would have been not the case as this would have cast doubt on RND being uniformly distributed.

If an event is expected to occur only 1% of the time assuming some null hypothesis is true and that event occurs on the first and only test then we can rightly question the truth of the null hypothesis.

However, if we did the test 100 times then we should not be surprised to see one of the tests in the same vein as our single test.

I looked at 1000 RNDs 10000 times.

I then saw "...there is an x% probability that it will produce results which only occur x% of the time." Well, it would wouldn't it?

Oh, dear.

The rule of thumb was a rough rounding. I should be looking at the actual distribution compared with the expected one much along the lines that Charles has been doing with the binomial and Chi-squared. Since we are looking at large samples then the Central Limit Theorem allows us to use the normal distribution which is why I used it - pity about my interpretation.

IP: Logged

Emil Menzel
Member
posted February 03, 2007 05:51 PM     Click Here to See the Profile for Emil Menzel     Edit/Delete Message   Reply w/Quote
Charles:
A relevant quote from R.R. Sokal & F.J. Rohlf, Biometry, 1969, p. 569
on using chi-square to test goodness of fit of empirical data
to a binomial distribution:
"Since expected frequencies < 5 should be avoided, we lump the
classes ["bins"] at both tails with the adjacent classes of adequate
size. Corresponding classes of observed frequencies should be
lumped to match" (p. 569)

This is the magic number I remember hearing most often, but don't ask
me where it came from. (Maybe because most statisticians are
pentadactyl??)

Some texts might advise dropping rather than pooling classes.

But in general statistics is best viewed as a supplement rather
than a substitute for common sense, so I would interpret Sokal & Rohlf
accordingly.

------------------

IP: Logged

Charles Pegge
Member
posted February 03, 2007 06:50 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Some curious results:

In this test the tails have dropped out themselves because
less than two were expected.

Each seed seems to have its own Chi-Squared profile. It may be better to pick your seeds
specifically to get the best profile, rather than choose seeds at random.

This is one example. Whereas seed=1 produces a large #12 value
this one peaks in #9.

Results for Randomize(2)

#9 deviates from the norm. (1% significance = 50 @ 30 degrees of freedom)


Chi Squared Variance Probability Distribution for p=.5 with 1000 repeats of 26 in 30 sessions
Comparing with binomial distribution for (.5+.5)^26
30 degrees of freedom for each of 26 categories.
Randomizer seed=2

| #0 ChiSq = 0.
| #1 ChiSq = 0.
| #2 ChiSq = 0.
| #3 ChiSq = 0.
| #4 ChiSq = 0.
| #5 ChiSq = 0.
|_______________________________ #6 ChiSq = 27.1
|________________________________________ #7 ChiSq = 34.3
|__________________________ #8 ChiSq = 22.5
|____________________________________________________________ #9 ChiSq = 51.9
|_______________________ #10 ChiSq = 20.1
|________________________________________ #11 ChiSq = 34.6
|_________________________________ #12 ChiSq = 28.2
|______________________________________ #13 ChiSq = 32.5
|______________________ #14 ChiSq = 19.1
|_____________________________ #15 ChiSq = 24.9
|_________________________________ #16 ChiSq = 28.6
|_________________________________ #17 ChiSq = 28.6
|_____________________________ #18 ChiSq = 24.8
|___________________________________ #19 ChiSq = 30.4
|____________________________________________ #20 ChiSq = 37.8
| #21 ChiSq = 0.
| #22 ChiSq = 0.
| #23 ChiSq = 0.
| #24 ChiSq = 0.
| #25 ChiSq = 0.
| #26 ChiSq = 0.

total chiSq=445.398224885069 at 780 degrees of freedom

------------------
www.pegge.net

IP: Logged

James Graham-Eagle
Member
posted February 03, 2007 07:10 PM     Click Here to See the Profile for James Graham-Eagle     Edit/Delete Message   Reply w/Quote
Charles,

The algorithm used to generate the values of RND is probably the linear congruence method ...
see http://www.cs.utsa.edu/~wagner/laws/rng.html

------------------

IP: Logged

Charles Pegge
Member
posted February 03, 2007 07:40 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Thanks James, good article, and what a simple method!

By the way, In my tests above, there is only one seeding
at the beginning, and the Chi profile remains consistent, even when
the number of experiment cycles or 'sessions' vary.

------------------
www.pegge.net

IP: Logged

Emil Menzel
Member
posted February 03, 2007 09:23 PM     Click Here to See the Profile for Emil Menzel     Edit/Delete Message   Reply w/Quote
A reference I gave earlier http://www.powerbasic.com/support/forums/Forum7/HTML/002589.html
also implements and might help to explain linear congruential random
generators.


David:
Bob Zale made a brief statement about PB's random number generator
in the PB-DOS forum, in 2004 & said there the period is 2^32 http://www.powerbasic.com/support/forums/Forum3/HTML/001821.html

------------------

[This message has been edited by Emil Menzel (edited February 03, 2007).]

IP: Logged

David Roberts
Member
posted February 04, 2007 02:58 AM     Click Here to See the Profile for David Roberts     Edit/Delete Message   Reply w/Quote
Thanks, Emil.

What caught my eye was

quote:
In fact, excessive use of RANDOMIZE (or any other changes) will cause deterioration.

I was using RANDOMIZE for each set of 1000 RNDs, 10000 in all. Since I was calling RND 'only' 10 million times which is much less than 2^32 (over 4000 million) I removed RANDOMIZE except for its use initially. The results 'tightened' up. Old habits die hard - a lot of what we do isn't needed nowadays and harps back to the old 8 bit days when we had no speed, RAM and much besides.

IP: Logged

Emil Menzel
Member
posted February 04, 2007 10:36 AM     Click Here to See the Profile for Emil Menzel     Edit/Delete Message   Reply w/Quote
>>a lot of what we do isn't needed nowadays and harps back to the
old 8 bit days when we had no speed, RAM and much besides.

Very true of statistics in general. The statistical methods & tabled
values of probabilities that R.A. Fisher and his students put forth in the
1930's are archaic in the light of what can be done by way of computing
and Monte Carlo methods with a PC as compared with an abacus or a slide
rule.

But that reminds me: In Ann Arbor, Michigan in the early 1950's there
used to be a yearly contest between the campus main computer and
one or more Asian students on an abacus. The last time I heard the abacus
was still winning. But then I left the U of M in 1952.

------------------

IP: Logged

John Gleason
Member
posted February 04, 2007 01:58 PM     Click Here to See the Profile for John Gleason     Edit/Delete Message   Reply w/Quote
This may help explaining some of the results you are seeing, or
questions about the powerBasic pseudo random number generator.

1) The powerBasic prng produces 2^32 or 4GB of random "values."
The size of each value can be up to an EXT in size, but for best
statistical randomness (Diehard, etc.) it's usually best to use
the BYTE size.

2) There is one sequence only of 4GB of random values. Various seed
numbers place you at different starting points in this sequence, and the
exact same starting point can be duplicated by many different seeds.

3) Sequences from a given starting point can overlap other sequences
from a different starting point significantly, and even nearly in
their entirety, even tho their seeds may not seem to be related,
eg. 1,2,3,4,5,6,7,8,9,10 overlaps 3,4,5,6,7,8,9,10,11,12,13

4) From any seed, after 4GB of values, you will repeat those values
exactly again. Since there is one sequence only of random values,
whatever seed is given will produce the same 4GB of values, just shifted
by a maximum of 2GB. In effect, all seeds produce sequences that
will overlap each other after a maximum of 2GB of values.

5) It follows that if you exceed 4GB of random values, simply
re-seeding the generator will not give you new random values.
Rather, it will repeat a portion of the sequence you have already
generated. If you re-seed before you reach 4GB of values, you risk
repeating part of your sequence even sooner.

[This message has been edited by John Gleason (edited February 04, 2007).]

IP: Logged

Emil Menzel
Member
posted February 04, 2007 06:59 PM     Click Here to See the Profile for Emil Menzel     Edit/Delete Message   Reply w/Quote
John:
Thank you for the list. Do you have a reference for it?

A feasible procedure for avoiding potential problems (assuming that
one really needs more than 4 billion random numbers) would, I presume,
be to switch to different RNGs periodically. Or are there problems
with that too?

Charles:
I like the strategy James used (a few pages back) to circumvent
problems about PB's random number generator and get back to the more
basic math problem. I had asked what is the best estimate of
Pi one could get with PB, and my method was based on the ratio of two
long integers. Obviously, this population of numbers is not
infinitely large and no number was infinitely accurate in
precision. All possible outcomes could readily be enumerated --
which (at least for my problem) rendered data sampling
irrelevant & unnecessary.

By the same token, all possible values of TIMER could also be enumerated.
For the problem of the mean and for the problem of the shape of the
frequency distribution, the precise sequence in which the RNDs occur is
of no significance. Therefore both of these problems can be simplified considerably.

That reminds me of a simple program I wrote a couple of years ago
to demonstrate the influence of the precision of one's random numbers
on the precision of one's estimates of means, sd's and pi. The user is
given a choice of integer data or precision to 1,2,...18 decimals.
All possible numbers (and pairs of numbers) that one can get are then
dealt with, exhaustively. Of course if one wants even 6-place numbers one
one is going to have a long wait for the answer, but the point
of the exercise is usually clear long before that time. I suspect
that probability theorists might have a straightforward formula or principle
that will predict and explain such data, but I have not yet run across
it.

P.S. I hope that you are not advocating picking the right seed numbers
for hypothetically random numbers, to get the answers that one expects
on the basis of theory. My teachers would have turned pale at the
thought.

------------------

IP: Logged

Charles Pegge
Member
posted February 04, 2007 07:52 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Emil,
I must warn you that at this very moment, I am deeply embroiled
in devising a method for picking the best seeds. It is based
on the Chi graph above. To qualify as a good seed the Chi2 of any
category must not exceed 35 with 30 degrees of freedom. I am going
to leave the test to run overnight. So far it has come up with
1122
1135
1220
1309
1412
where the seed is passed to the Randomize function as a double
precision variable, if that makes any difference.

Blum Blum Shub is worth checking out. I found it a bit more
noisy than PB's RND function, but maybe I didnt pick the best
primes.

------------------
www.pegge.net

IP: Logged

John Gleason
Member
posted February 04, 2007 08:38 PM     Click Here to See the Profile for John Gleason     Edit/Delete Message   Reply w/Quote
Emil,
The data above has been gleaned from many (dozens of) runs of the pb generator
thru it's entire cycle, then comparing the binary raw extended precision
ten-byte sequences that result. I'd be happy to post the code showing
examples of each point if you'd like, but as you have noted, it is
certainly preferable to circumvent any possible problems by using another
proven generator. I can post a quick and proven one--which I recently
discovered can even be done without using assembler--and only takes perhaps
ten lines of code.

I agree too that the idea of fitting the seed to the data... well,
gray probably would be the best of the worrisome shades that statistics gods
might turn. hehheh

IP: Logged

Charles Pegge
Member
posted February 05, 2007 06:17 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
John,
If you are willing to post your code, I would like to try it with my test to
see what a higher quality pseudo-random sequence looks like.

I tested about 3000 seeds with it last night and found only 47 which passed
the test.(Chi2<35)

One of the good things about the Chi-Squared is that it will tell you if your
results are 'too good', which in this case would be a value below about 15 in any
category. I found this did not happen very often.


The test of course applies specifically to a fixed range of numbers in the
pseudo-random sequence. In this case, 26*1000*30=780,000, with each number
being used as a 'coin flip'.

Here is a typical result:

Matching with expected binomial distribution:
New seed 2036
....
|______________________________________________ #6 ChiSq = 25.9
|________________________________________________ #7 ChiSq = 27.3
|__________________________________________________________ #8 ChiSq = 32.8
|________________________________________________ #9 ChiSq = 27.4
|_____________________________________________________ #10 ChiSq = 30.1
|______________________________________ #11 ChiSq = 21.7
|______________________________________ #12 ChiSq = 21.6
|___________________________________________________ #13 ChiSq = 28.7
|____________________________________________________________ #14 ChiSq = 34.
|______________________________________________ #15 ChiSq = 26.1
|___________________________________ #16 ChiSq = 19.7
|_____________________________________ #17 ChiSq = 21.2
|___________________________________________________ #18 ChiSq = 29.
|______________________________________ #19 ChiSq = 21.5
|_____________________________________________________ #20 ChiSq = 30.3
....

------------------
www.pegge.net

IP: Logged

John Gleason
Member
posted February 05, 2007 08:28 AM     Click Here to See the Profile for John Gleason     Edit/Delete Message   Reply w/Quote
Charles,
no problemo, I'll post it as soon as possible.

IP: Logged


This topic is 10 pages long:   1  2  3  4  5  6  7  8  9  10 

All times are EasternTime (US)

next newest topic | next oldest topic

Administrative Options: Close Topic | Archive/Move | Delete Topic
Post New Topic  Post A Reply
Hop to:

Contact Us | PowerBASIC BASIC Compilers

Copyright © 1999-2006 PowerBASIC, Inc. All Rights Reserved.


Ultimate Bulletin Board 5.45c