dev-resources.site
for different kinds of informations.
Analysis of ECS 236th meeting abstracts(2) - word embedding by Word2Vec and SCDV
Introduction
This is an serial article about language analysis of ECS 236th meeting abstracts.
In this series, I've been explaining the technique used in my webapp ECS Meeting Explorer. The introduction of this app is available in an article below,
ECS Meeting Explorer - webapp for scientific conference
My previous article about data scraping is in following link,
Analysis of ECS 236th meeting abstracts(1) - data scraping with BeautifulSoup4
In this atricle, I will give an explanation of word embedding, vectorization of words used in all abstract text.
Preparation
In the series of article, I will use Python. Please install these libraries.
numpy > 1.14.5
pandas > 0.23.1
matplotlib > 2.2.2
beautifulsoup4 > 4.6.0
gensim > 3.4.0
scikit-learn > 0.19.1
scipy > 1.1.0
Before the analysis, please download all ECS 236th meeting abstracts from official site. Unzip and place it in same directory as jupyter-notebook.
Data scraping by BeautifulSoup4 was explained in my previous article, please check it before!
Word embedding by Word2Vec
Word2Vec (W2V) is a machine learning model used to produce word embedding, which is words mapping to vector space.
Word2Vec is a kind of unsupervised learning, therefore we don't have to label training data. It is precious to me because it is a hard job at any time.
In this experiments, we use Word2Vec implemented in Gensim. So, we don't have to make models by ourselves. Further information about Word2Vec are below,
models.word2vec – Word2vec embeddings(Gensim documentation)
Word2vec Tutorial | RARE Technologies
The original paper of word2vec.
Distributed Representations of Words and Phrases and their Compositionality
Now, we have a list contains detail of all abstract, title, authors, affiliations, session name, and contents as follows,
> dic_all
[{'num': '0001',
'title': 'The Impact of Coal Mineral Matter (alumina and silica) on Carbon Electrooxidation in the Direct Carbon Fuel Cell',
'author': ['Simin Moradmanda', 'Jessica A Allena', 'Scott W Donnea'],
'affiliation': 'University of Newcastle',
'session': 'A01',
'session_name': 'Battery and Energy Technology Joint General Session',
'contents': 'Direct carbon fuel cell DCFC as an electrochemical device...',
'mod_contents': ['direct','carbon','fuel','cell', ... ,'melting'],
'vector': 0,
'url': '1.html'}, ... ]
Then, Let's get lists of words modified for language analysis.
# make word list for W2V learning
docs = [i['mod_contents'] for i in dic_all]
This is a code for learning Word2Vec model. Only few lines!
#Word2Vec model learning and save it.
from gensim.models.word2vec import Word2Vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = Word2Vec(docs, sg=1, size=200, window=5, min_count=30, workers=4, sample=1e-6, negative=5, iter=1000)
print('corpus = ',model.corpus_count)
The line 5 'model = Word2Vec(docs, ...)' corresponds to Word2Vec learning. The parameter 'size' sets the dimension of word vectors, in this case, 200. Please see documents for the other parameters of this function.
After the learning, make a vocabulary and word vectors from Word2Vec model.
This vocabulary is saved as .npy file in same directory.
#make word dictionary
vocab = [i for i in model.wv.vocab]
dictionary = {}
for n,j in enumerate(vocab):
dictionary[j] = n
np.save('dictionary.npy', np.array([dictionary]))
#make word vectors from the model.
word_vectors = [model.wv[i] for i in model.wv.vocab]
word_vectors = np.array(word_vectors)
Now, we gets the words list and corresponding vectors.
In this vector space, similarity between words is expressed as a distance. We usually uses cosine distance for such a high-dimentional vector spaces.
The function for calculate a word similarity is below,
def CalcSim(target, vectors, dictionary):
target_vec = vectors[dictionary[target]]
search_results = []
for n,vector in enumerate(vectors):
sim = cos_sim(target_vec,vector)
result = {'num': n, 'value': list(dictionary.keys())[n], 'similarity': sim}
search_results.append(result)
summary_pd = pd.io.json.json_normalize(search_results)
summary_sorted = summary_pd.sort_values('similarity', ascending=False)
return summary_sorted
Okay, Let's search the similar words to 'sustainable', recent buzz-words.
target='sustainable'
summary_sorted = CalcSim(target, word_vectors, dictionary)
summary_sorted[:10]
The result, top 10 words are shown like this,
num | similarity | value |
---|---|---|
588 | 1 | sustainable |
105 | 0.648442 | renewable |
100 | 0.552662 | energy |
1625 | 0.54727 | fuels |
862 | 0.541807 | efficient |
1624 | 0.53353 | fossil |
13 | 0.521877 | electricity |
607 | 0.480525 | technologies |
138 | 0.472065 | production |
108 | 0.471985 | wind |
The word most similar to 'sustainable' is 'renewable'.
It's satisfactory result, isn't it?
2-dimentional visualization of word vectors
As I mentioned, the size of word vectors is 200.
It is impossible for human beings to imagine such a high dimensional data. Dimension reduction is needed for the visualization.
In this case, we will use Principal Component Analysis (PCA) from 200 to 100, and t-distributed Stochastic Neighbor Embedding (t-SNE) from 100 to 2. These methods are implemented in scikit-learn.
The function for dimension reduction is this,
from sklearn.decomposition import IncrementalPCA
from sklearn.manifold import TSNE
def tsne_reduction(dataset):
n = dataset.shape[0]
batch_size = 500
ipca = IncrementalPCA(n_components=100)
for i in tqdm(range(n//batch_size)):
r_dataset = ipca.partial_fit(dataset[i*batch_size:(i+1)*batch_size])
r_dataset = ipca.transform(dataset)
r_tsne = TSNE(n_components=2, random_state=0, perplexity=50.0, n_iter=3000).fit_transform(r_dataset)
return(r_tsne)
w2v_tsne = tsne_reduction(word_vectors)
Now, we can plot 2-dimensional word vectors.
Left shows the scatter plots of all word vectors, Right shows some highlighted points with corresponding words.
Precomputation of word-topic vectors by SCDV
We can estimate document vectors of abstract by averaging this word vectors with certain weight (such as tf-idf). But in this case, I will apply a method named SCDV: Sparse Composite Document Vectors to modify word vectors.
There are 2 steps for SCDV to build a document vector.
- Precomputation of word-topics vectors.
- Build sparse document vectors using word-topics vectors.
In this section, I will explain the former process.
This is a flow chart of computing word-topic vectors (image from here). It is divided into 3 process.
- Word vectors are classified into several clusters with soft clustering algorithms, which allows words to belong to every cluster with certain probability.
- Word-cluster vectors are made by multiplying vectors with the probability of belonging for each cluster.
- Concatenate all word-cluster vectors with idf (inverse document frequency) weighting to form word-topic vector.
This is a function to transform word vectors to word-topic vectors.
def WordTopicVectors(word_vectors)
#Gaussian Mixture Modelling
num_clusters = 30
clf = GaussianMixture(n_components=num_clusters,covariance_type="full")
z_gmm = clf.fit(word_vectors)
idx = clf.predict(word_vectors)
idx_proba = clf.predict_proba(word_vectors)
#Calculate word idf
words = list(dictionary.keys())
words = np.array(words)
word_idf = np.zeros_like(words, dtype=np.uint32)
for doc in tqdm(docs):
lim = len(doc)
for w in doc:
if(lim == 0):
break
else:
idx = np.where(w == words)
word_idf[idx] += 1
lim -= 1
word_counts = word_idf
word_idf = np.log(len(docs) / word_idf) + 1
#Concatenate word vector with GMM cluster
gmm_word_vectors = np.empty((word_vectors.shape[0], word_vectors.shape[1] * num_clusters))
n = 0
for vector,proba,idf in zip(word_vectors,idx_proba,word_idf):
for m,p in enumerate(proba):
if(m == 0):
cluster_vector = vector * p
else:
cluster_vector = np.hstack((cluster_vector,vector * p))
gmm_word_vectors[n] = idf * cluster_vector
n += 1
return(gmm_word_vectors)
#Calculate word-topic vectors
gmm_word_vectors = WordTopicVectors(word_vectors)
In this function, we used gaussian mixture model for clustering. The number of cluster is recommended as 60 or higher in original paper, but now I choose 30 (because of memory issue for webapp).
The dimension of word-topic vectors will be 200(original word vector)×30(number of cluster) = 6000.
Then, visualize it with t-SNE dimension reduction!
Comparing to the word vectors by Word2Vec, The clusters for each words are separated clearly.
This means that these vectors well represent the relationship between words and topics.
Let's see the details of each cluster and corresponding words.
This figure clearly shows that words of same topic belongs to the same cluster.
Conclusion
In this article, I demonstrated the word embedding by W2V and modification by SCDV.
I will explain about building document vector with this word-topic vectors!
Featured ones: