I've been actively researching Topic Models (part of Natural Language Processing, NLP) during 2011-2014.
I've implemented a few enhancements to the Latent Dirichlet Allocation model:
- Simulated Annealing (for faster Gibbs sampling convergence)
- Topic Keywords (specify high probability words for specific topics)
- Multiprocessor Parallelisation (you can specify thread count)
The first two are described in research notes which you can find here.
The code itself now could probably be only useful as a historical reference...
Download Source Code
- akuz-java - various Java libraries, including:
akuz-nlp- Natural Language Processing (NLP) library
akuz-nlp-run-lda- How to run LDA Gibbs sampling
Download Test Data
The below zip files contain abstracts (or full texts, depending on the source) of news articles. Close duplicates from the same source have been removed. The data does not have source names or timestamps. First line in each file is a title.
- news_1k.zip (416 Kb) — first 1,000 news after 1 Jan 2013, 00:00:00
- news_10k.zip (4.3 Mb) — first 10,000 news after 1 Jan 2013, 00:00:00
To use this data with algorithms from the NLP library, unpack the archive into a directory on your computer, and then specify that directory in the parameters to the program (see
akuz-nlp-run-lda project for an example).