Data Mining
Machine Learning

The Message Machine

2 Technologies

Document Clustering
Decision Trees

Decision Trees are Good
because they look like this:

  dollar_amount >= 101.66 0.4
   state >= 33 0.125
    dollar_amount >= 227.817 0.5
     42252 => {42252=>1.0}
     42251 => {42251=>1.0}
    42252 => {42252=>10.0}
   42251 => {42251=>25.0, 42256=>3.0}

and they are non linear.

Decision Tree Basic Algorithm

build_tree(data)
  @best  = Infinity
  @left  = []
  @right = []
  for each @attr in data:
    for every @value of @attr
      left  = values in @attr < @value
      right = values in @attr >= @value
      if entropy(left, right) < @best:
        @left = left, @right = right
  build_tree(@left)
  build_tree(@right)

Entropy?

You don't have to build your own, there is Weka

Document Clustering and Similarity

The Simple Way: Min Hash

Invented at Altavista

Simple Idea

Turn every word in a document into a number via a hash function, and take the lowest number.
If 2 documents have the same number, they are similar.

Example:

"The cat ran" => -1912293768743167748
"The cat ran away" => -1912293768743167748
"The dog sat" => -2092219696032009264

Demo

The Best Part?
It can work on any file not just text based documents

The second best part?
One line of code:

  "The cat ran away".split(/\W+/).map(&:hash).min

The Hard More Accurate Way

TF-IDF

Pairwise comparisons with cosine similarity

Simple algorithm, you can get fancier

cluster(docs)
  clusters = [Cluster.new(docs.shift)]
  for doc in docs:
    for cluster in clusters:
      if cosinesim(cluster, doc) > threshold
        cluster.add(doc)
    if(doc.cluster.nil?)
      clusters.add(Cluster.new(doc))
  return clusters

Thanks

Data Mining Machine Learning

Data Mining
Machine Learning