Milestone II MADS

Topic Modeling

LDA and GSDMM: Short Text Clustering Below are a list of visualizations from a wikipedia training data set

Explore the Assigned Topics and Original Text

We used wikipedia training dataset, processed the data and asigned topics to the original text. If you would like to explore the features and create some Visualizations you can visit our heroku app

We used Datasette to allow for exploration and sharing our data with others.


Predict Topic Text

We ran the Gibbs Sampling Dirichlet Mixture Model (GSDMM) model, a type of LDA specifically designed for shorter texts, on the original texts contained in a Wikipedia training dataset to obtain 20 topic clusters and assigned them to the original texts. We used a TfidfVectorizer, and a naive bayes MultinomialNB classifier to create a pipeline to predict probablity of topics on new text. Try it out for yourself below!


Interactive Visualizations for Topic Modeling Clusters

LDA TSNE

LDA TSNE

LDA MMDS

LDA MMDS

LDA PCOA

LDA PCOA

GSDMM Topics

GSDMM Topics