NLP系列之run哥带你手撕Word2Vec

Posted on 2022-09-08 Edited on 2023-02-27 In NLP

来聊聊NLP中的经典模型 word2vec

https://www.bilibili.com/video/BV1hU4y1q7Ru/?spm_id_from=333.788&vd_source=a2c222cd47523aaf38b1191c9089cd79

发展进程

最初的 ont-hot 并不能很好的表示词与词之间的联系

svd的分解效率低，可解释性差

之后提出了词向量

分布式表示/稠密表示 Distributed Representation

时间线在2013年计算速度得到了极大地提升所以变得流行

这里直接上代码了

学习

https://github.com/shouxieai/word_2_vec

概念

word_2_index,index_2_word,word_2_onehot

来区别一下几个概念

CBOW & skip-gram

简单简介看这个link https://zhuanlan.zhihu.com/p/37477611

skip-gram 就是用中心词来预测周围的词。

word-size 就是统计出来总共不重复的词向量个数 embedding-num 就是选择用多少维向量来表示一个词

比如今天天气真好这句话用天气去预测真好

word –> vec 的核心思想一个高维的onehot x 一个矩阵 –> 得到低纬的矩阵，实现降维

w1，w2随机生成通过损失函数迭代优化

通过softmax归一化得到预测值

code

首先切词用jeiba切一下然后删除停用词

import jieba
import numpy as np
import pandas as pd
import pickle
import os


def load_stop_words(file="stopwords.txt"):
    with open(file,"r",encoding="utf-8") as f:
        return f.read().split("\n")



def cut_words(file="数学原始数据.csv"):
    stop_words = load_stop_words()

    result = []
    all_data = pd.read_csv(file,encoding="gbk",names=["data"])["data"]
    for words in all_data:
        c_words = jieba.lcut(words)
        result.append([word for word in c_words if word not in stop_words])
    return result

获取上面三个值

def get_dict(data):
    # 重复词过滤
    index_2_word = []
    for words in data:
        for word in words:
            if word not in index_2_word:
                index_2_word.append(word)

    word_2_index = {word:index for index,word in enumerate(index_2_word)}
    word_size = len(word_2_index)

    word_2_onehot = {}
    for word,index in word_2_index.items():
        one_hot = np.zeros((1,word_size))
        one_hot[0,index] = 1
        word_2_onehot[word] = one_hot

    return word_2_index,index_2_word,word_2_onehot

训练过程

if __name__ == '__main__':
    data = cut_words()
    word_2_index, index_2_word, word_2_onehot = get_dict(data)
    
    word_size = len(word_2_index)
    embedding_num = 100
    lr = 0.01
    epoch = 10
    n_gram = 3 # 相关词
    
    w1 = np.random.normal(-1,1,size=(word_size,embedding_num))
    w2 = np.random.normal(-1,1,size=(embedding_num,word_size))
    
    for e in range(epoch):
        for words in data:
            for n_index, now_word in enumerate(words):
                now_word_onehot = word_2_onehot[now_word]
                other_words = words[max(n_index-n_gram,0):n_index] + words[n_index+1:n_index+n_gram+1]
                for other_word in other_words:
                    other_word_onehot = word_2_onehot[other_word]
                    
                    hidden = now_word_onehot @ w1
                    p = hidden @ w2
                    pre = softmax(p)
                    
                    # loss = -np.sum(other_word_onehot * np.log(pre))
                    
                    G2 = pre -other_word_onehot
                    delta_w2 = hidden.T @ G2
                    G1 = G2 @ w2.T
                    delta_w1 = now_word_onehot.T @ G1
                    
                    w1 -= lr * delta_w1
                    w2 -= lr * delta_w2