DEEP Website

Topic modeling is a part of Natural Language Processing which helps in finding out hidden topics present in the documents, as group of words. Topic modeling gives insights about what the documents are composed of and which words compose the topics. In short, we can understand and summarize large collection of textual data using topic modeling.

There are a lot of textual data collected in DEEP by various analysts around the world. Although all such texts and documents are humanitarian or crisis related, we are always interested in and in need of more precise insight of the documents. Basically, we want to find the hierarchical composition of documents collected.

The library that we used for this purpose is gensim which is very popular for topic modeling. We chose gensim because of its elegant implementation of algorithms and simplicity. The following is the sample code for performing topic modeling.

from gensim import modelsdef find_topics(documents, num_topics): """ Return the keywords for topics discovered @documents: documents to be analyzed @num_topics: number of topics we wish to find """ texts=[ pre_processor(document).split() for document indocuments ] dictionary= corpora.Dictionary(texts) corpus= [dictionary.doc2bow(text) for text intexts] ldamodel= models.ldamodel.LdaModel( corpus, num_topics=num_topics, id2word=dictionary, passes=passes ) lda_output={} topics= ldamodel.get_topics() for i, topic_words inenumerate(topics): # sort keywords based on their weights/contributions to the topic sorted_words=sorted( list(enumerate(topic_words)), key=lambdax: x[1], reverse=True ) lda_output['Topic {}'.format(i)] ={ 'keywords': [ (dictionary.get(i), x) for i, x insorted_words[:num_words] ] } returnlda_output

Now, given some documents, the result of running the above function would result in output similar to the following:

{ "Topic 1": { "keywords": ["kw11, "kw12", ...] }, "Topic 2": { "keywords": ["kw21, "kw22", ...] }, ...}

This gives the topics present in the documents as important keywords. One thing that can be noticed here is that the topics returned are not the exact topics, we do not have name for the topics. We needed the exact topic names for our application. We tackled this problem by a very simple but good-to-go approach: taking the first most relevant keyword for the topic.

Hierarchical Topic Modeling

The next problem with the above result is that we are also interested in hierarchical composition of the topics in the documents but this just gives a flat topics composition. Unfortunately, we could not find any library that fitted our purpose of hierarchical modeling. And, we applied another simple approach to achieve the hierarchy: recursion.

Our idea about achieving hierarchical topic modeling consists of the following steps:

Perform topic modeling on the documents to get initial topics composition.
For each topic, group all the documents belonging to the topic.
For each topic, run the topic modeling algorithm on the grouped documents to obtain subtopics composition for the topic.
Repeat the process from step 2 until the desired hierarchy level is obtained.

The following is the simplified code for the above steps.

def find_topics_and_subtopics(documents, num_topics, depth=5, dictionary=None, corpus=None): """ Return the keywords for topics discovered @documents: documents to be analyzed @num_topics: number of topics we wish to find @depth: depth of hierarchy """ texts=[ pre_processor(document).split() for document indocuments ] if dictionary isNone: dictionary= corpora.Dictionary(texts) if corpus isNone: corpus= [dictionary.doc2bow(text) for text intexts] ldamodel= models.ldamodel.LdaModel( corpus, num_topics=num_topics, id2word=dictionary, passes=passes ) lda_output={} topics= ldamodel.get_topics() for i, topic_words inenumerate(topics): # sort keywords based on their weights/contributions to the topic sorted_words=sorted( list(enumerate(topic_words)), key=lambdax: x[1], reverse=True ) lda_output['Topic {}'.format(i)] ={ 'keywords': [ (dictionary.get(i), x) for i, x insorted_words[:num_words] ] 'subtopics': {} } if depth <=1: returnlda_output # Group documents topic_documents={} for j, corp inenumerate(corpus): topics_prob=sorted( ldamodel.get_document_topics(corp), key=lambdax: x[1] ) for i, topic inenumerate(topics_prob): topic= 'Topic {}'.format(i) curr_set= topic_documents.get(topic, set()) curr_set.add(j) topic_documents[topic]=curr_set # Now the recursive call for topic inlda_output: if topic_documents.get(topic): topic_docs= [corpus[x] for x intopic_documents[topic]] lda_output[topic]['subtopics']=find_topics_and_subtopics( topic_docs, num_topics, depth-1, dictionary, corpus ) returnlda_output

The result of running the above function would result in output similar to the following:

{ "Topic 1": { "keywords": ["kw11, "kw12", ...], "subtopics": { "Topic 1": { "keywords": ["kw1.11", "kw1.12", ... ], "subtopics": { ... } }, ... } }, "Topic 2": { "keywords": ["kw21, "kw22", ...], "subtopics": { "Topic 1": { "keywords": ["kw2.21", "kw2.22", ... ], "subtopics": { ... } }, ... } }, ...}

The Problems and Possible Solutions

‍

The following are the problems that we’ve faced:

The first and the foremost, gensim has implemented Latent Dirichlet Allocation(LDA) which is a statistical method which therefore does random initialization of topics and keywords. So, if we don’t do more iterations, running the algorithm multiple times with same documents result in quite different results.
Although more iteration means more stable and predictable output, it causes the algorithm to run slow for large documents. Finding a balance between these is a challenge and completely experimental. It also depends on the kinds of documents we have.
Also, since we’ve introduced recursive solution for hierarchical modeling, the complexity grows exponentially with depths of topics. And with large documents, it becomes very slow.

The first two problems are inherent in the algorithm itself and is just a matter of finding out the balance between stable results and faster calculation. However, for the third problem mentioned above, we ‘ve implemented caching in server side to prevent running the algorithm multiple times if the documents have not changed.