Friday, November 29, 2013

Debugging bayesian filters

It is a common problem when a trained filter miscategorizes something and you'd like to know why - what is really the feature that puts the most weight into the wrong category? Or sometimes you'd like to know what are the features that puts another object into the right category - maybe we should add them somehow to the object? In AI::NaiveBayes::Classification there is a method called find_predictors that finds the most features that weight most for and against the category that a given object is eventually classified under. The simple algorithm assumes that there are only two categories - but it should be possible to extend it for more categories. The returned numbers are hard to interpret - but what is important is how big they are in comparison with other numbers in the result.

We use the classifier for spam detection - whenever we get misclassified posts I check what words (or other features) push them into the wrong category and I decide what to do - should be improve the training examples, add more post features or maybe we can just ignore the case. When improving the filters, by adding or removing examples I can check how that changes the classification and also how exactly it changes the influence of each important feature on the result.

Friday, January 11, 2013

Immutable objects and Dependency Injection

At some point we had:
in our code.

Text::FeatureCount was at that time an accumulator that was counting the document features and saving them in internal structures. Multiple subroutines were using these structures - so it made sense to make them the object attributes. But it also meant that changing Text::FeatureCount into something else was a big deal. We could make the class name variable: $feature_counter_class->new->analyze(... and add an attribute to store it to the main object. But I decided to make Text::FeatureCounter an immutable object instead and inject a ready made Text::FeatureCounte into the object doing the work above. Now it can be used in the loop without re-constructing it to clean the internal structures. When coding with immutable objects you have to pass around the input data from one method to another explicitly instead of keeping it readily available in the object attributes: This makes it is slightly un-object-oriented, but the benefit is that you don't need to re-construct the object and so you can use Dependency Injection on it and make the code more flexible. It is also easier to reason about the algorithm when some parts are immutable. Often this is a good trade-off.

PS. After much other refactoring Text::FeatureCount mutated into Text::WordCounter (and AI::Classifier::Text::Analyzer) - soon to be released to CPAN.