Topic modeling is a part of Natural Language Processing which helps in finding out hidden topics present in the documents, as group of words. Topic modeling gives insights about what the documents are composed of and which words compose the topics. In short, we can understand and summarize large collection of textual data using topic modeling.
There are a lot of textual data collected in DEEP by various analysts around the world. Although all such texts and documents are humanitarian or crisis related, we are always interested in and in need of more precise insight of the documents. Basically, we want to find the hierarchical composition of documents collected.
The library that we used for this purpose is gensim
which is very popular for topic modeling. We chose gensim
because of its elegant implementation of algorithms and simplicity. The following is the sample code for performing topic modeling.
from gensim import models
def
find_topics
(documents, num_topics):
"""
Return the keywords for topics discovered
@documents: documents to be analyzed
@num_topics: number of topics we wish to find
"""
texts =
[
pre_processor(document).
split()
for
document
in
documents
]
dictionary =
corpora
.
Dictionary(texts)
corpus =
[dictionary
.
doc2bow(text)
for
text
in
texts]
ldamodel =
models
.
ldamodel
.
LdaModel(
corpus,
num_topics=
num_topics,
id2word=
dictionary,
passes=
passes
)
lda_output =
{}
topics =
ldamodel
.
get_topics()
for
i, topic_words
in
enumerate(topics):
# sort keywords based on their weights/contributions to the topic
sorted_words =
sorted(
list(enumerate(topic_words)),
key=lambda
x: x[1],
reverse=
True
)
lda_output['Topic {}'.
format(i)]
=
{
'keywords': [
(dictionary.
get(i), x)
for
i, x
in
sorted_words[:num_words]
]
}
return
lda_output
Now, given some documents, the result of running the above function would result in output similar to the following:
{ "Topic 1": { "keywords": ["kw11, "kw12", ...] }, "Topic 2": { "keywords": ["kw21, "kw22", ...] }, ...}
This gives the topics present in the documents as important keywords. One thing that can be noticed here is that the topics returned are not the exact topics, we do not have name for the topics. We needed the exact topic names for our application. We tackled this problem by a very simple but good-to-go approach: taking the first most relevant keyword for the topic.
The next problem with the above result is that we are also interested in hierarchical composition of the topics in the documents but this just gives a flat topics composition. Unfortunately, we could not find any library that fitted our purpose of hierarchical modeling. And, we applied another simple approach to achieve the hierarchy: recursion.
Our idea about achieving hierarchical topic modeling consists of the following steps:
The following is the simplified code for the above steps.
def
find_topics_and_subtopics
(documents, num_topics, depth
=
5, dictionary
=
None, corpus
=
None):
"""
Return the keywords for topics discovered
@documents: documents to be analyzed
@num_topics: number of topics we wish to find
@depth: depth of hierarchy
"""
texts =
[
pre_processor(document).
split()
for
document
in
documents
]
if
dictionary
is
None:
dictionary =
corpora
.
Dictionary(texts)
if
corpus
is
None:
corpus =
[dictionary
.
doc2bow(text)
for
text
in
texts]
ldamodel =
models
.
ldamodel
.
LdaModel(
corpus,
num_topics=
num_topics,
id2word=
dictionary,
passes=
passes
)
lda_output =
{}
topics =
ldamodel
.
get_topics()
for
i, topic_words
in
enumerate(topics):
# sort keywords based on their weights/contributions to the topic
sorted_words =
sorted(
list(enumerate(topic_words)),
key=lambda
x: x[1],
reverse=
True
)
lda_output['Topic {}'.
format(i)]
=
{
'keywords': [
(dictionary.
get(i), x)
for
i, x
in
sorted_words[:num_words]
]
'subtopics': {}
}
if
depth
<=
1:
return
lda_output
# Group documents
topic_documents =
{}
for
j, corp
in
enumerate(corpus):
topics_prob =
sorted(
ldamodel.
get_document_topics(corp),
key=lambda
x: x[1]
)
for
i, topic
in
enumerate(topics_prob):
topic =
'Topic {}'
.
format(i)
curr_set =
topic_documents
.
get(topic, set())
curr_set.
add(j)
topic_documents[topic] =
curr_set
# Now the recursive call
for
topic
in
lda_output:
if
topic_documents
.
get(topic):
topic_docs =
[corpus[x]
for
x
in
topic_documents[topic]]
lda_output[topic]['subtopics'] =
find_topics_and_subtopics(
topic_docs,
num_topics,
depth-
1,
dictionary,
corpus
)
return
lda_output
The result of running the above function would result in output similar to the following:
{ "Topic 1": { "keywords": ["kw11, "kw12", ...], "subtopics": { "Topic 1": { "keywords": ["kw1.11", "kw1.12", ... ], "subtopics": { ... } }, ... } }, "Topic 2": { "keywords": ["kw21, "kw22", ...], "subtopics": { "Topic 1": { "keywords": ["kw2.21", "kw2.22", ... ], "subtopics": { ... } }, ... } }, ...}
The following are the problems that we’ve faced:
gensim
has implemented Latent Dirichlet Allocation(LDA) which is a statistical method which therefore does random initialization of topics and keywords. So, if we don’t do more iterations, running the algorithm multiple times with same documents result in quite different results.The first two problems are inherent in the algorithm itself and is just a matter of finding out the balance between stable results and faster calculation. However, for the third problem mentioned above, we ‘ve implemented caching in server side to prevent running the algorithm multiple times if the documents have not changed.
Topic modeling is a part of Natural Language Processing which helps in finding out hidden topics present in the documents, as group of words. Topic modeling gives insights about what the documents are composed of and which words compose the topics. In short, we can understand and summarize large collection of textual data using topic modeling.
There are a lot of textual data collected in DEEP by various analysts around the world. Although all such texts and documents are humanitarian or crisis related, we are always interested in and in need of more precise insight of the documents. Basically, we want to find the hierarchical composition of doucments collected.
The library that we used for this purpose is gensim
which is very popular for topic modeling. We chose gensim
because of its elegant implementation of algorithms and simplicity. The following is the sample code for performing topic modeling.
from gensim import models
def
find_topics
(documents, num_topics):
"""
Return the keywords for topics discovered
@documents: documents to be analyzed
@num_topics: number of topics we wish to find
"""
texts =
[
pre_processor(document).
split()
for
document
in
documents
]
dictionary =
corpora
.
Dictionary(texts)
corpus =
[dictionary
.
doc2bow(text)
for
text
in
texts]
ldamodel =
models
.
ldamodel
.
LdaModel(
corpus,
num_topics=
num_topics,
id2word=
dictionary,
passes=
passes
)
lda_output =
{}
topics =
ldamodel
.
get_topics()
for
i, topic_words
in
enumerate(topics):
# sort keywords based on their weights/contributions to the topic
sorted_words =
sorted(
list(enumerate(topic_words)),
key=lambda
x: x[1],
reverse=
True
)
lda_output['Topic {}'.
format(i)]
=
{
'keywords': [
(dictionary.
get(i), x)
for
i, x
in
sorted_words[:num_words]
]
}
return
lda_output
Now, given some documents, the result of running the above function would result in output similar to the following:
{ "Topic 1": { "keywords": ["kw11, "kw12", ...] }, "Topic 2": { "keywords": ["kw21, "kw22", ...] }, ...}
This gives the topics present in the documents as important keywords. One thing that can be noticed here is that the topics returned are not the exact topics, we do not have name for the topics. We needed the exact topic names for our application. We tackled this problem by a very simple but good-to-go approach: taking the first most relevant keyword for the topic.
The next problem with the above result is that we are also interested in hierarchical composition of the topics in the documents but this just gives a flat topics composition. Unfortunately, we could not find any library that fitted our purpose of hierarchical modeling. And, we applied another simple approach to achieve the hierarchy: recursion.
Our idea about achieving hierarchical topic modeling consists of the following steps:
The following is the simplified code for the above steps.
def
find_topics_and_subtopics
(documents, num_topics, depth
=
5, dictionary
=
None, corpus
=
None):
"""
Return the keywords for topics discovered
@documents: documents to be analyzed
@num_topics: number of topics we wish to find
@depth: depth of hierarchy
"""
texts =
[
pre_processor(document).
split()
for
document
in
documents
]
if
dictionary
is
None:
dictionary =
corpora
.
Dictionary(texts)
if
corpus
is
None:
corpus =
[dictionary
.
doc2bow(text)
for
text
in
texts]
ldamodel =
models
.
ldamodel
.
LdaModel(
corpus,
num_topics=
num_topics,
id2word=
dictionary,
passes=
passes
)
lda_output =
{}
topics =
ldamodel
.
get_topics()
for
i, topic_words
in
enumerate(topics):
# sort keywords based on their weights/contributions to the topic
sorted_words =
sorted(
list(enumerate(topic_words)),
key=lambda
x: x[1],
reverse=
True
)
lda_output['Topic {}'.
format(i)]
=
{
'keywords': [
(dictionary.
get(i), x)
for
i, x
in
sorted_words[:num_words]
]
'subtopics': {}
}
if
depth
<=
1:
return
lda_output
# Group documents
topic_documents =
{}
for
j, corp
in
enumerate(corpus):
topics_prob =
sorted(
ldamodel.
get_document_topics(corp),
key=lambda
x: x[1]
)
for
i, topic
in
enumerate(topics_prob):
topic =
'Topic {}'
.
format(i)
curr_set =
topic_documents
.
get(topic, set())
curr_set.
add(j)
topic_documents[topic] =
curr_set
# Now the recursive call
for
topic
in
lda_output:
if
topic_documents
.
get(topic):
topic_docs =
[corpus[x]
for
x
in
topic_documents[topic]]
lda_output[topic]['subtopics'] =
find_topics_and_subtopics(
topic_docs,
num_topics,
depth-
1,
dictionary,
corpus
)
return
lda_output
The result of running the above function would result in output similar to the following:
{ "Topic 1": { "keywords": ["kw11, "kw12", ...], "subtopics": { "Topic 1": { "keywords": ["kw1.11", "kw1.12", ... ], "subtopics": { ... } }, ... } }, "Topic 2": { "keywords": ["kw21, "kw22", ...], "subtopics": { "Topic 1": { "keywords": ["kw2.21", "kw2.22", ... ], "subtopics": { ... } }, ... } }, ...}
The following are the problems that we’ve faced:
gensim
has implemented Latent Dirichlet Allocation(LDA) which is a statistical method which therefore does random initialization of topics and keywords. So, if we don’t do more iterations, running the algorithm multiple times with same documents result in quite different results.The first two problems are inherent in the algorithm itself and is just a matter of finding out the balance between stable results and faster calculation. However, for the third problem mentioned above, we ‘ve implemented caching in server side to prevent running the algorithm multiple times if the documents have not changed.