Natural Language Processing with Disaster Tweets 学习记录

啥也没用 就是看看代码 入门一下

Dataset

数据集 https://www.kaggle.com/competitions/nlp-getting-started/data

学习的话 跟着这个notebook学的

https://www.kaggle.com/code/airundatago/basic-eda-cleaning-and-glove/edit

Data cleaning

直接开始看data cleaning 看看都做了哪些事?

spelling correction,removing punctuations,removing html tags and emojis etc

要针对这些方面进行处理 o.o

感觉挺经典的可以记录学习一下

Removing urls

1
2
3
4
5
6
example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"
def remove_URL(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'',text)

remove_URL(example)
1
df['text']=df['text'].apply(lambda x : remove_URL(x))

Removing HTML tags

1
2
3
4
5
6
7
8
9
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""
def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'',text)
print(remove_html(example))

Romoving Emojis

1
2
3
4
5
6
7
8
9
10
11
12
13
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

Removing punctuations

1
2
3
4
5
6
def remove_punct(text):
table=str.maketrans('','',string.punctuation)
return text.translate(table)

example="I am a #king"
print(remove_punct(example))

Spelling Correction

1
!pip install pyspellchecker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
corrected_text = []
misspelled_words = spell.unknown(text.split())
for word in text.split():
if word in misspelled_words:
corrected_text.append(spell.correction(word))
else:
corrected_text.append(word)
return " ".join(corrected_text)

text = "corect me plese"
correct_spellings(text)

Use Glove

use GloVe pretrained corpus model to represent our words