Data Mining
Machine Learning

The Message Machine
2 Technologies
  1. Document Clustering
  2. Decision Trees
Decision Trees are Good
because they look like this:
  dollar_amount >= 101.66 0.4
   state >= 33 0.125
    dollar_amount >= 227.817 0.5
     42252 => {42252=>1.0}
     42251 => {42251=>1.0}
    42252 => {42252=>10.0}
   42251 => {42251=>25.0, 42256=>3.0}
and they are non linear.
Decision Tree Basic Algorithm
build_tree(data)
  @best  = Infinity
  @left  = []
  @right = []
  for each @attr in data:
    for every @value of @attr
      left  = values in @attr < @value
      right = values in @attr >= @value
      if entropy(left, right) < @best:
        @left = left, @right = right
  build_tree(@left)
  build_tree(@right)
Entropy?
You don't have to build your own, there is Weka
Document Clustering and Similarity
The Simple Way: Min Hash
Invented at Altavista
Simple Idea
  1. Turn every word in a document into a number via a hash function, and take the lowest number.
  2. If 2 documents have the same number, they are similar.
Example:
Demo
The Best Part?
It can work on any file not just text based documents
The second best part?
One line of code:
  "The cat ran away".split(/\W+/).map(&:hash).min
The Hard More Accurate Way
TF-IDF


Pairwise comparisons with cosine similarity
Simple algorithm, you can get fancier
cluster(docs)
  clusters = [Cluster.new(docs.shift)]
  for doc in docs:
    for cluster in clusters:
      if cosinesim(cluster, doc) > threshold
        cluster.add(doc)
    if(doc.cluster.nil?)
      clusters.add(Cluster.new(doc))
  return clusters
Thanks