You can find all my public code at github.com/akuz, below are just some things I wanted to highlight.
The main code you would probably want to look at is optimised LDA Gibbs sampling in the
akuz-nlp library, which includes the following enhancements over the standard implementations:
- Multiprocessor Parallelisation (you can specify thread count)
- Simulated Annealing (for faster Gibbs sampling convergence)
- Topic Keywords (specify high probability words for specific topics)
Download Source Code
- akuz-java - various Java libraries, including:
akuz-nlp- Natural Language Processing (NLP) library
akuz-nlp-run-lda- How to run LDA Gibbs sampling
Download Test Data
The below zip files contain abstracts (or full texts, depending on the source) of news articles. Close duplicates from the same source have been removed. The data does not have source names or timestamps. First line in each file is a title.
- news_1k.zip (416 Kb) — first 1,000 news after 1 Jan 2013, 00:00:00
- news_10k.zip (4.3 Mb) — first 10,000 news after 1 Jan 2013, 00:00:00
To use this data with algorithms from the NLP library, unpack the archive into a directory on your computer, and then specify that directory in the parameters to the program (see
akuz-nlp-run-lda project for an example).