Bayesian Classification of CPAN Module Failures (Re: Module dependencies and test results)

Mon Aug 6 09:25:44 BST 2007

David Cantrell wrote:
> The data I'm spitting out are insufficient for calculating this.  If
> module A depends on B and C, both of which depend on D, then D appears
> as a dependency twice, but I only list it once. 

Hmm... I'm not sure that it matters.  If we were classifying documents then we 
would need to be counting word frequency to determine how often "D appears in 
A".  By analogy, the more times "Viagra" appears in a message, the more likely 
it is to be spam.

But in this case, I think the number of different ways in which D is a 
dependency of A is immaterial.  D only needs to be a dependency once for it to 
cause a failure.  Adding more or less dependencies from A to D won't make A 
any more or less broken (where n > 1).  One is enough.

> Also I'm not convinced
> that the rest of the sums are right, given that if D fails when you try
> to install it as a dependency of B, then the probability of it failing
> when you try to install it as a dependency of C is 1 - these are not
> independent failures.

Remember that we're dealing with probabilities here, not inductive logic.

So rather than looking at what happens when *you* install modules A, B, C and 
D, we're looking at what happens when *lots* of different people install these 
modules on different systems.

Your failure with module D gives *you* a probability for failure of 1 in that 
very small sample set (1 failure from 1 test).  But when you consider the 
other 99 people who installed module D without a hitch it becomes clear that 
the *overall* probability for failure is 1/100.

A