Bagaimana cara menghapus kata berhenti dari file csv dengan python?

Question

Stopwords adalah kata-kata bahasa Inggris yang tidak menambahkan banyak arti kalimat. Mereka dapat dengan aman diabaikan tanpa mengorbankan makna kalimat. Misalnya, kata-kata seperti, dia, memiliki dll. Kata-kata seperti ini sudah terekam dalam corpus yang bernama corpus. Kami pertama kali mengunduhnya ke lingkungan python kami

Table of Contents Show

Memverifikasi Stopwords
Bagaimana Anda menghapus kata-kata berhenti di Python tanpa NLTK?
Bagaimana cara menghapus kata berhenti dari Excel dengan Python?
Bagaimana Anda menghapus kata berhenti dan tanda baca dengan Python?
Modul Python mana yang digunakan untuk menghapus kata berhenti?

import nltk
nltk.download('stopwords')

Ini akan mengunduh file dengan stopword bahasa Inggris

Memverifikasi Stopwords

from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]

_

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', 
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', 
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', 
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']

Macam-macam bahasa selain bahasa Inggris yang memiliki stopword tersebut adalah sebagai berikut

from nltk.corpus import stopwords
print stopwords.fileids()

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish', 
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian', 
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

_

Contoh

Kami menggunakan contoh di bawah ini untuk menunjukkan bagaimana stopwords dihapus dari daftar kata

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words: 
    if word not in en_stops:
        print(word)

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

There
tree
near
river

_

Anda mencoba memeriksa apakah daftar (hasil dari regex) ada dalam satu set. operasi ini tidak dapat dilakukan. Anda perlu mengulang daftar (atau melakukan semacam operasi set, mis. g. set(tw).difference(stop_words)

Hanya untuk kejelasan

>>> tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", initial.lower()).split())
>>> tw
['this', 'is', 'an', 'example']
>>> set(tw).difference(stop_words)
{'example'}

Kemudian tambahkan saja clean_tw perbedaannya. ) Sesuatu seperti

clean_tw = []
df = pd.read_csv(self.file_name, usecols=col_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))

_

Terakhir, Anda dapat menentukan stop_words di luar loop karena ini akan selalu menjadi set yang sama, sehingga Anda sedikit meningkatkan kinerja. )

Proses mengubah data menjadi sesuatu yang dapat dipahami komputer disebut sebagai pra-pemrosesan. Salah satu bentuk utama pra-pemrosesan adalah menyaring data yang tidak berguna. Dalam pemrosesan bahasa alami, kata-kata (data) yang tidak berguna, disebut sebagai kata-kata berhenti

Apa itu kata-kata Hentikan?

Hentikan Kata-kata. Stop word adalah kata yang umum digunakan (seperti "the", "a", "an", "in") yang telah diprogram untuk diabaikan oleh mesin telusur, baik saat mengindeks entri untuk penelusuran maupun saat mengambilnya sebagai hasilnya .
Kami tidak ingin kata-kata ini menghabiskan ruang di basis data kami, atau menghabiskan waktu pemrosesan yang berharga. Untuk ini, kami dapat menghapusnya dengan mudah, dengan menyimpan daftar kata yang Anda anggap menghentikan kata. NLTK (Natural Language Toolkit) dengan python memiliki daftar stopwords yang disimpan dalam 16 bahasa berbeda. Anda dapat menemukannya di direktori nltk_data. home/pratima/nltk_data/corpora/stopwords adalah alamat direktori. (Jangan lupa untuk mengubah nama direktori home Anda)

Untuk memeriksa daftar stopwords Anda dapat mengetikkan perintah berikut di shell python.

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

_

{'diri kita', 'miliknya', 'antara', 'dirimu', 'tetapi', 'lagi', 'di sana', 'tentang', 'sekali', 'selama', 'keluar', 'sangat', '
Catatan. Anda bahkan dapat memodifikasi daftar dengan menambahkan kata-kata pilihan Anda dalam bahasa Inggris. txt. file di direktori stopwords.

Menghapus kata berhenti dengan NLTK

Program berikut menghapus kata berhenti dari sepotong teks.

Python3

from nltk.corpusimport stopwords

from ________54__51_______ word_tokenize

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

1

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

3

________10

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish', 
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian', 
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

_10_______5

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

7

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

9from0from1from2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

from4

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 from6

from_7

from_8

from9

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 nltk.corpus1nltk.corpus2 nltk.corpus3nltk.corpus4 from4nltk.corpus6 nltk.corpus7 _________50______8nltk.corpus4 _______4

import1

________49

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

_10_______2 import4

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

nltk.corpus2 nltk.corpus3nltk.corpus4 import9

________52______0nltk.corpus6 nltk.corpus3nltk.corpus7 nltk.corpus4 stopwords5

stopwords6stopwords7

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

________52

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

_49_______0

________52

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

_49_______2

Keluaran

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

Melakukan operasi Stopwords dalam file

Pada kode di bawah ini, teks. txt adalah file input asli di mana stopwords harus dihapus. filteredtext. txt adalah file keluaran. Itu dapat dilakukan dengan menggunakan kode berikut.

Python3

import from4

from nltk.corpusimport from8

from nltk.tokenizeimport nltk.tokenize2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

nltk.tokenize4

nltk.tokenize5

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

7

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

9from0from1import1

import2

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 import4import5import6import7

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

_0

import_9

word_tokenize0

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 word_tokenize2

word_tokenize3

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 word_tokenize5

nltk.corpus2 word_tokenize7nltk.corpus4 word_tokenize9

________52______0nltk.corpus6 nltk.corpus7 word_tokenize7nltk.corpus4

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

05

________52______6

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

07

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

2 import4import5

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

11

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

12

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

13import7

________52______6

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

16

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

17

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

18

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

19

stopwords6

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

21

Inilah cara kami membuat konten yang diproses menjadi lebih efisien dengan menghapus kata-kata yang tidak berkontribusi pada operasi apa pun di masa mendatang

Artikel ini disumbangkan oleh Pratima Upadhyay. Jika Anda menyukai GeeksforGeeks dan ingin berkontribusi, Anda juga dapat menulis artikel menggunakan tulis. geeksforgeeks. org atau kirimkan artikel Anda ke review-team@geeksforgeeks. org. Lihat artikel Anda muncul di halaman utama GeeksforGeeks dan bantu Geeks lainnya

Silakan tulis komentar jika Anda menemukan sesuatu yang salah, atau jika Anda ingin berbagi informasi lebih lanjut tentang topik yang dibahas di atas.

Bagaimana Anda menghapus kata-kata berhenti di Python tanpa NLTK?

Ada beberapa cara untuk melakukannya. ==> Hapus semua s di akhir kata, atau gandakan stopword Anda dan tambahkan s ke masing-masing, atau gunakan metode len() untuk melihat apakah suatu bagian sama persis . Hal kedua yang mungkin ingin Anda pertimbangkan (dan ini paling baik dilakukan sebelum melakukan stemming). . A second thing you might want to consider (and this is best done before stemming).

Bagaimana cara menghapus kata berhenti dari Excel dengan Python?

Akhirnya, Anda juga dapat menghapus kata henti dari daftar kata henti default NLTK. Untuk melakukannya, gunakan fungsi remove() dan berikan kata stop yang ingin Anda hapus .

Bagaimana Anda menghapus kata berhenti dan tanda baca dengan Python?

Untuk menghapus stopwords dan tanda baca menggunakan NLTK, kita harus mendownload semua stopwords menggunakan nltk. download('stopwords'), maka kita harus menentukan bahasa yang ingin kita hapus stopwordsnya, oleh karena itu, kita menggunakan stopwords. kata-kata ('bahasa Inggris') untuk menentukan dan menyimpannya ke variabel