Software

You can find all my public code at  github.com/akuz
Below are just some things I wanted to highlight

The main code you would probably want to look at is optimised LDA Gibbs sampling in the akuz-nlp library, which includes the following enhancements over the standard implementations:

Download Source Code

Download Test Data

The below zip files contain abstracts (or full texts, depending on the source) of news articles. Close duplicates from the same source have been removed. The data does not have source names or timestamps. First line in each file is a title.

To use this data with algorithms from the NLP library, unpack the archive into a directory on your computer, and then specify that directory in the parameters to the program (see akuz-nlp-run-lda project for an example).