Software / Topic Models

Between 2011 and 2014, I actively explored topic modeling within the field of Natural Language Processing (NLP), focusing on enhancements to the Latent Dirichlet Allocation (LDA) algorithm. My work aimed to improve the efficiency, flexibility, and interpretability of LDA through several key innovations:

Simulated Annealing for Gibbs Sampling: Implemented simulated annealing to accelerate the convergence of Gibbs sampling, enhancing the efficiency of topic inference.
Topic Keyword Constraints: Introduced the ability to specify high-probability words for specific topics, allowing for guided topic formation based on domain knowledge.
Multiprocessor Parallelization: Enabled multi-threaded execution of the LDA algorithm, allowing users to specify the number of threads to leverage multicore processors effectively.

These enhancements are detailed in my research notes, which you can find here.

Source Code on GitHub

While the codebase may now serve primarily as a historical reference, it includes several Java libraries developed during this research:

akuz-java - various Java libraries, including:
akuz-nlp - A library for Natural Language Processing tasks
akuz-nlp-run-lda - An implementation of LDA with Gibbs sampling.

Download Test Data

The below zip files contain abstracts (or full texts, depending on the source) of news articles. Close duplicates from the same source have been removed. The data does not have source names or timestamps. First line in each file is a title.

news_1k.zip (416 Kb) — first 1,000 news after 1 Jan 2013, 00:00:00
news_10k.zip (4.3 Mb) — first 10,000 news after 1 Jan 2013, 00:00:00

To use this data with algorithms from the NLP library, unpack the archive into a directory on your computer, and then specify that directory in the parameters to the program (see akuz-nlp-run-lda project for an example).