📚 LDA vs Document Clustering

Sat 29 March 2014
Experiments
akuz

I was asked at the interview what’s the difference between LDA and document clustering. I tried to explain it by explaining the difference between generative models that are assumed for the respective models. However, now I realise it would have been much more effective to give a much simpler example.

Bread Data

Imagine you have a dataset of objects that you can broadly classify as “plain bread” and “bread with seeds”. For this example, it is important that these objects share some similarity, but also have important differences:

With the document clustering approach, if you had a model that would need to group these objects into 2 clusters, then you would end up with the following results:

Bread Cluster

However, in the LDA approach you would not be inferring the document clusters. Instead, you would be inferring the “ingredients” of the objects, i.e. what they consist of. By running the LDA on our dataset you would end up with the following result:

Bread Ingredient

You would also get a probability of each ingredient in each object (document).