I think that your class can be really useful but I completely miss the point of the results it shows when applied to the example your have provided.
Also, I believe it is damn slow (I was oblige to set the maximum time limit to 120 sec for it to produce an output).
Can you provide more examples or at least explain what you're trying to achieve? I ask this question in the context of a tool we're writing that helps assign a taxonomy to a web page. In other terms, we're constructing a tool whose goal is to see if a page talks about economy, cars, planes, software, ...
Many of the things we do is common to your code: we chop sentences in words, we compare words to stems (root of words), ... In the end, we want to be able to rank a page compared to the intended subject and audience.
By sharing some experience, we might be in a position to offer a good and unique tool to the developer community (we're open source oriented).
Can you revert back to me on this subject? my email address is email@example.com. Thanks, Denis.