Natural Language Processing with Disaster Tweets 学习记录

Posted on 2022-09-22 Edited on 2023-03-17 In NLP

啥也没用就是看看代码入门一下

Dataset

数据集 https://www.kaggle.com/competitions/nlp-getting-started/data

学习的话跟着这个notebook学的

https://www.kaggle.com/code/airundatago/basic-eda-cleaning-and-glove/edit

Data cleaning

直接开始看data cleaning 看看都做了哪些事？

spelling correction,removing punctuations,removing html tags and emojis etc

要针对这些方面进行处理 o.o

感觉挺经典的可以记录学习一下

Removing urls

example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

remove_URL(example)

1	df['text']=df['text'].apply(lambda x : remove_URL(x))

Removing HTML tags

example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
print(remove_html(example))

Romoving Emojis

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

Removing punctuations

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

example="I am a #king"
print(remove_punct(example))

Spelling Correction

1	!pip install pyspellchecker

from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
text = "corect me plese"
correct_spellings(text)

Use Glove

use GloVe pretrained corpus model to represent our words