Explore the Assigned Topics and Original Text
We used wikipedia training dataset, processed the data and asigned topics to the original text. If you would like to explore the features and create some Visualizations you can visit our heroku app
We used Datasette to allow for exploration and sharing our data with others.
Predict Topic Text
We ran the Gibbs Sampling Dirichlet Mixture Model (GSDMM) model, a type of LDA specifically designed for shorter texts, on the original texts contained in a Wikipedia training dataset to obtain 20 topic clusters and assigned them to the original texts. We used a TfidfVectorizer, and a naive bayes MultinomialNB classifier to create a pipeline to predict probablity of topics on new text. Try it out for yourself below!