Statistics of certainty with low success rate, many trials, and imperfect information

**Telok** · 2020-05-04, 06:47 PM (ISO 8601)

This is a math question, not a game problem. It is also just a bit beyond my skill at stats, considering how long ago the courses were and only usually needing the simpler parts. Some of it is that I'm not sure if I'm missing parts of this.

A number of rpg task resolution systems have times or places in them where it is possible, but unlikely, for anyone to succeed at something. Be it hacking a computer, identifying a magic spell, decoding an alien message, treating a poison victim, whatever. On occasion a player will suggest crowd sourcing the solution (which is kind ofwhat I'm doing here?) because with enough rolls someone will succeed. This is a math question, not a game problem.

Assuming a bad/random/false result can occur, how mant trials would be needed to be run in order to get a reasonably strong signal out of the noise?

Lets use the d20 because the flat distribution keeps things simpler. Given: a 5% success rate, a 45% rate where its an obvious failure, and a 50% rate of random answers that look like they could be sucesses. What more information is needed for us to determine the number of trials needed to acheive, say, a 75% confidence that we can get the true success result? Obviously what those random answers/error results look like will influence that but I haven't been easily able to answer how to work that into the question. This is a math question, not a game problem.

**NichG** · 2020-05-04, 07:34 PM (ISO 8601)

Originally Posted by Telok

This is a math question, not a game problem. It is also just a bit beyond my skill at stats, considering how long ago the courses were and only usually needing the simpler parts. Some of it is that I'm not sure if I'm missing parts of this.

A number of rpg task resolution systems have times or places in them where it is possible, but unlikely, for anyone to succeed at something. Be it hacking a computer, identifying a magic spell, decoding an alien message, treating a poison victim, whatever. On occasion a player will suggest crowd sourcing the solution (which is kind ofwhat I'm doing here?) because with enough rolls someone will succeed. This is a math question, not a game problem.

Assuming a bad/random/false result can occur, how mant trials would be needed to be run in order to get a reasonably strong signal out of the noise?

Lets use the d20 because the flat distribution keeps things simpler. Given: a 5% success rate, a 45% rate where its an obvious failure, and a 50% rate of random answers that look like they could be sucesses. What more information is needed for us to determine the number of trials needed to acheive, say, a 75% confidence that we can get the true success result? Obviously what those random answers/error results look like will influence that but I haven't been easily able to answer how to work that into the question. This is a math question, not a game problem.

Lets split this in two parts. One part is easy - how many trials do I need to get N 'valid' trials that aren't obvious failures. E.g. we'll just discard the obvious failures outright.

The second part is, given something that isn't an obvious failure, how should we update our beliefs based on the outcome? We can use Bayes' Rule here. Lets say we have two possible answers A=0 and A=1 (call this 'a' for whichever), and we get a random sample B=0 or B=1 (call this 'b' for whichever).

p(A=a | B=b) = p(A=a) * p(B=b | A=a) / p(B=b)

If we think that in general the correct answer is 0 half the time and 1 half the time (no prior belief weighting towards one or the other), then p(B=0) and p(B=1) will both be 1/2, and the initial values of p(A=0) and p(A=1) will also both be 1/2 (these change as we accumulate evidence).

So what we need to evaluate an update where we see 'b' happen is p(B=b | A=a). We know this from the chances you gave: among non-failures, 1/11 are the real answer and 10/11 are random (even chance 0 or 1). So:

p(B=0|A=0) = p(B=1|A=1) = 6/11
p(B=1|A=0) = p(B=0|A=1) = 5/11

That means every time we observe a non-obvious-failure, we should multiply our belief in the answer corresponding to that outcome by 12/11, and the belief in the opposite by 10/11.

If we want to get to 75% certainty, we need x * ln(12/11) + y * ln(10/11) = ln(0.75/0.5) where x is the number of observations of the dominant hypothesis, and y is the number of observations of the reverse. The mean number of observations of x after N trials is 6*N/11, and the mean number of observations of y after N trials is 5*N/11. So, we can expect to need

N = ln(1.5) / ( 6/11 * ln(12/11) + 5/11 * ln(10/11)) ~= 100 valid trials

Which means ~ 200 total trials.

I think this is right, but generally I'd double check by simulating the process to make sure. If you've seen a particular number of 0's and 1's you can adjust your estimate of how many more trials you might need.

Edit: Simulated it, and it looks like I've got a factor of 2 error somewhere. I'm getting about 113-117 trials needed for 75% of runs to indicate the truth.

**Ibrinar** · 2020-05-04, 08:00 PM (ISO 8601)

It is quite possible that you can get extra certainty from whether different results match depending on the range of outcomes for the 50%. Otherwise I think NichG's approach works for things where you need to figure out what the right answer is. (Though note that if it isn't two fifty/fifty binary b but one option being really unlikely it needs stronger evidence to become likely.)

Although for the treating poison scenario if the attempts can't do damage too (which should really be possible) I think the answer is a bit different since only one needs to succeed and you don't really need to know which method working we just need to be reasonably sure we at least have one true success among them

**Telok** · 2020-05-05, 05:25 PM (ISO 8601)

Thanks for the responses. I spent most of that day with a small child vomiting on me. Kid is better now.

Being a better programmer than a statistician I did a quick code. Again 5% correct, 45% known fail, 50% error. It picked the correct from a list and the errors from the remainder of the list then rolled the virtual d20 a bunch.

At 11 entries the true result has the same percentage as each error result. You can't tell. That's expected. At 15 entries and 600 trials I was kind of confident of choosing the right result almost all the time. At 15 entries and 500 trials it was really iffy, almost a coin toss between the most common results at times. It took going up to 800 trials before I was really confident. I also went up to using a full on crypto pseudorandom generator because even at 700 trials I would see results where several errors were more common than the correct.

Going with 8 entries, 1 correct and 7 errors, was a complete toss up. No effective difference in my ability to choose the correct from the list. At 300 trials of 6 entries the confidence was good. With only 2 entries it's a question of the number of trials to get a single correct result. I was happy at 50 trials there.

**Ibrinar** · 2020-05-07, 06:25 AM (ISO 8601)

Ah so the 50% can't randomly give the right result but always gives random wrong ones? Yeah that changes things. As you know more entries help. But if there are very few entries that also helps. If there are only two entries (and you know that as well as the chances) it becomes obvious after a few trials because you have say 1 for the right 7 for the false so it is easy to pick the right. But that way requires people to know the number of entries to identify the solution. (Or at least whether it is lower or higher than 11)

Edit: math wise it is 1/11 chance for the right outcome and 10/(11*(entries-1)) chance for every other entry. Now for a given number of trials I think we can pick the highest/lowest (depending on whether over or lower than 11) and compare relative cumulative Binomial Probabilities to determine the relative probability of getting that many/few from the the error or the successes. But to get the required number of trials you would then have to weight the possible outcomes by their probability and multiply with the confidence for that scenario which sounds annoying (though I guess a script could do it easily enough and just increase trial number until the chance is high enough). Is there a more elegant way?

**Telok** · 2020-05-07, 04:31 PM (ISO 8601)

I don't know of a better way than brute forcing it. But then I'm stuck at eyeballing each result to figure out if I can tell the difference and running the whole thing several times to get just a gut check on confidence.

The 5% success rate and obvious failures are trivial bits. It's the failure by X amount returning one of Y possible false positives that I'm unsure of. I know it gets done, this is the sort of thing you'd see in massive chemical assays or sentence fragment parsing in document analysis ai. I think I could even brute force that if I had a better handle on it.

Ok, with more false positives than the rate of each fp being the same as the success rate we want the smallest number result from the result set. Then, for a given trial count, how often that result really is the correct one. Then you csn vary the trial count to find a range of probabilites of chooing the correct.

I may be able to work that.

Thread: Statistics of certainty with low success rate, many trials, and imperfect information

Thread Tools

Spoilers