tsJensen

A quest for software excellence...

Atrax Keyword Extraction Algorithm

Two and a half years ago I wrote an implementation in C# of an algorithm published in 2003 in a short academic paper by Yutaka Matsuo and Mitsuru Ishizuka in the International Journal of Artificial Intelligence Tools. Of course, the algorithm is not a perfect implementation of the algorithm published in the "Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information" paper. I made a number of decisions to make the algorithm as effective as possible while keeping it as fast as I could.

The code was written for Provo Labs, my employer at the time. I've recently obtained written permission from Provo Labs to release this code as open source under the Apache 2.0 license. You can get the code in the Atrax.Html project, a part of the entire Atrax project which I've just released, at http://www.codeplex.com/atrax. Here's the core of the code.

string[] terms = new string[termsG.Count];
termsG.Values.CopyTo(terms, 0); //gives terms array where last term is the MAX g in G
foreach (string w in terms)
{
    decimal sumZ = 0;
    for (int i = 0; i < terms.Length - 1; i++) //do calcs for all but MAX
    {
        string g = terms[i];
        if (w != g) //skip where on the diagonal
        {
            int nw = termNw[w];
            decimal Pg = termPg[g];
            decimal D = nw * Pg;
            if (D != 0.0m)
            {
                decimal Fwg = termFwg[w][terms[i]];
                decimal T = Fwg - D;
                decimal Z = (T * T) / D;
                sumZ += Z;
            }
        }
    }
    termsX2[w] = sumZ;
}

SortedDictionary<decimal, string> sortedX2 = new SortedDictionary<decimal, string>();
foreach (KeyValuePair<string, decimal> pair in termsX2)
{
    decimal x2 = pair.Value;
    while (sortedX2.ContainsKey(x2))
    {
        x2 = x2 - 0.00001m;
    }
    sortedX2.Add(x2, pair.Key);
}

//now get simple array of values as lowest to highest X2 terms
string[] x2Terms = new string[sortedX2.Count];
sortedX2.Values.CopyTo(x2Terms, 0);

I have not spent much time on this algorithm in the past two years and would like to find others with similar interests to help me improve and perfect it. If you have an interest in this kind of research, please join me at the Atrax project page on Codeplex.