The main code you would probably want to look at is optimised LDA Gibbs sampling in the akuz-nlp library, which includes the following enhancements over the standard implementations:

The below zip files contain abstracts (or full texts, depending on the source) of news articles. Close duplicates from the same source have been removed. The data does not have source names or timestamps. First line in each file is a title.

To use this data with algorithms from the NLP library, unpack the archive into a directory on your computer, and then specify that directory in the parameters to the program (see akuz-nlp-run-lda project for an example).