Bagaimana Anda menghapus kata-kata berhenti dari sebuah string dengan python tanpa nltk?

Stopwords adalah kata-kata bahasa Inggris yang tidak menambahkan banyak arti kalimat. Mereka dapat dengan aman diabaikan tanpa mengorbankan makna kalimat. Misalnya, kata-kata seperti, dia, memiliki dll. Kata-kata seperti ini sudah terekam dalam corpus yang bernama corpus. Kami pertama kali mengunduhnya ke lingkungan python kami

Table of Contents Show

Memverifikasi Stopwords
Apa itu kata berhenti?
Mengapa kami menghapus kata-kata berhenti?
Apakah kita selalu menghapus kata berhenti?
Apa perbedaan perpustakaan untuk menghapus kata-kata berhenti?
Bisakah saya menambahkan kata henti saya sendiri ke dalam daftar?
Bisakah saya menghapus kata-kata berhenti dari daftar premade?
Bagaimana cara menghapus kata-kata berhenti khusus dengan Python?
Apa itu stopwords dan bagaimana cara menghapus stopwords?
Bagaimana Anda menghapus kata berhenti dan tanda baca dengan Python?
Modul Python mana yang digunakan untuk menghapus kata berhenti?

import nltk
nltk.download('stopwords')

Ini akan mengunduh file dengan stopword bahasa Inggris

Memverifikasi Stopwords

from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', 
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', 
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', 
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']

Macam-macam bahasa selain bahasa Inggris yang memiliki stopword tersebut adalah sebagai berikut

from nltk.corpus import stopwords
print stopwords.fileids()

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish', 
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian', 
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

Contoh

Kami menggunakan contoh di bawah ini untuk menunjukkan bagaimana stopwords dihapus dari daftar kata

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words: 
    if word not in en_stops:
        print(word)

Ketika kami menjalankan program di atas, kami mendapatkan output berikut -

There
tree
near
river

Kami sangat menyadari fakta bahwa komputer dapat dengan mudah memproses angka jika diprogram dengan baik. 🧑🏻‍💻 Namun, sebagian besar informasi yang kami miliki berbentuk teks. 📗 Kami berkomunikasi satu sama lain dengan berbicara langsung dengan mereka atau menggunakan pesan teks, postingan media sosial, panggilan telepon, panggilan video, dll. Untuk membuat sistem cerdas, kita perlu menggunakan informasi yang kita miliki dalam kelimpahan ini

Natural Language Processing (NLP) adalah cabang dari Kecerdasan Buatan yang memungkinkan mesin untuk menginterpretasikan bahasa manusia. 👍🏼 Namun, hal yang sama tidak dapat digunakan langsung oleh mesin, dan kami harus melakukan pra-proses yang sama terlebih dahulu

Pra-pemrosesan teks adalah proses menyiapkan data teks sehingga mesin dapat menggunakannya untuk melakukan tugas seperti analisis, prediksi, dll. Ada banyak langkah berbeda dalam pra-pemrosesan teks, tetapi dalam artikel ini, kita hanya akan mengenal kata berhenti, mengapa kita menghapusnya, dan pustaka berbeda yang dapat digunakan untuk menghapusnya

Jadi, mari kita mulai. 🏃🏽‍♀️

Apa itu kata berhenti?

Kata-kata yang umumnya disaring sebelum memproses bahasa alami disebut kata berhenti. Ini sebenarnya adalah kata-kata yang paling umum dalam bahasa apa pun (seperti artikel, preposisi, kata ganti, kata sambung, dll) dan tidak menambahkan banyak informasi ke teks. Contoh beberapa stop word dalam bahasa Inggris adalah “the”, “a”, “an”, “so”, “what”

Mengapa kami menghapus kata-kata berhenti?

Kata-kata berhenti tersedia dalam kelimpahan dalam bahasa manusia apa pun. Dengan menghapus kata-kata ini, kami menghapus informasi tingkat rendah dari teks kami untuk memberi lebih banyak fokus pada informasi penting. Dengan kata lain, kita dapat mengatakan bahwa penghapusan kata-kata tersebut tidak menunjukkan konsekuensi negatif apa pun pada model yang kita latih untuk tugas kita

Penghapusan kata berhenti pasti mengurangi ukuran dataset dan dengan demikian mengurangi waktu pelatihan karena lebih sedikit jumlah token yang terlibat dalam pelatihan

Apakah kita selalu menghapus kata berhenti?

Jawabannya adalah tidak. 🙅‍♂️

Kami tidak selalu menghapus kata-kata berhenti. Penghapusan kata berhenti sangat tergantung pada tugas yang kita lakukan dan tujuan yang ingin kita capai. Misalnya, jika kami melatih model yang dapat melakukan tugas analisis sentimen, kami mungkin tidak menghapus kata henti

Ulasan film. “Filmnya tidak bagus sama sekali. ”

Teks setelah penghapusan kata berhenti. “film bagus”

Kami dapat dengan jelas melihat bahwa ulasan untuk film tersebut negatif. Namun, setelah penghapusan kata-kata berhenti, ulasannya menjadi positif, padahal kenyataannya tidak demikian. Dengan demikian, penghapusan kata-kata berhenti bisa menjadi masalah di sini

Tugas-tugas seperti klasifikasi teks umumnya tidak memerlukan kata-kata berhenti karena kata-kata lain yang ada dalam kumpulan data lebih penting dan memberikan gambaran umum tentang teks. Jadi, kami biasanya menghapus kata berhenti dalam tugas semacam itu

Singkatnya, NLP memiliki banyak tugas yang tidak dapat diselesaikan dengan baik setelah penghapusan kata-kata berhenti. Jadi, pikirkan sebelum melakukan langkah ini. Tangkapannya di sini adalah bahwa tidak ada aturan yang universal dan tidak ada daftar kata berhenti yang universal. Daftar yang tidak menyampaikan informasi penting apa pun ke satu tugas dapat menyampaikan banyak informasi ke tugas lainnya

Kata hati-hati. Sebelum menghapus kata-kata berhenti, teliti sedikit tentang tugas Anda dan masalah yang ingin Anda selesaikan, lalu buat keputusan

Apa perbedaan perpustakaan untuk menghapus kata-kata berhenti?

NLP adalah salah satu bidang yang paling banyak diteliti saat ini dan telah banyak perkembangan revolusioner di bidang ini. NLP mengandalkan keterampilan komputasi tingkat lanjut dan pengembang di seluruh dunia telah menciptakan banyak alat berbeda untuk menangani bahasa manusia. Dari begitu banyak perpustakaan di luar sana, beberapa di antaranya cukup populer dan banyak membantu dalam melakukan berbagai tugas NLP

Beberapa library yang digunakan untuk menghilangkan stop word bahasa Inggris, daftar stop word beserta kodenya diberikan di bawah ini

Perangkat Bahasa Alami (NLTK)

NLTK adalah perpustakaan luar biasa untuk bermain dengan bahasa alami. Ketika Anda akan memulai perjalanan NLP Anda, ini adalah perpustakaan pertama yang akan Anda gunakan. Langkah-langkah untuk mengimpor perpustakaan dan daftar kata berhenti bahasa Inggris diberikan di bawah ini

import nltk
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
print(sw_nltk)

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Mari kita periksa berapa banyak kata henti yang dimiliki perpustakaan ini

print(len(sw_nltk))

Keluaran

Mari kita hapus kata-kata berhenti dari teks

text = "When I first met her she was very quiet. She remained quiet during the entire two hour long journey from Stony Brook to New York."words = [word for word in text.split() if word.lower() not in sw_nltk]
new_text = " ".join(words)print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

Kode di atas cukup sederhana tetapi saya akan tetap menjelaskannya untuk pemula. Saya memiliki teksnya dan saya membagi teks ini menjadi kata-kata karena kata-kata berhenti adalah daftar kata-kata. Saya kemudian mengubah kata menjadi huruf kecil karena semua kata dalam daftar kata berhenti menggunakan huruf kecil. Kemudian saya membuat daftar semua kata yang tidak ada dalam daftar kata berhenti. Daftar yang dihasilkan kemudian digabungkan untuk membentuk kalimat lagi

Keluaran

first met quiet. remained quiet entire two hour long journey Stony Brook New York.
Old length:  129
New length:  82

Kita dapat melihat dengan jelas bahwa penghilangan stopword mengurangi panjang kalimat dari 129 menjadi 82

Harap perhatikan bahwa saya akan menggunakan kode serupa untuk menjelaskan kata-kata berhenti di setiap perpustakaan

spaCy

spaCy adalah pustaka perangkat lunak sumber terbuka untuk NLP tingkat lanjut. Pustaka ini cukup populer sekarang dan praktisi NLP menggunakannya untuk menyelesaikan pekerjaan mereka dengan cara terbaik

import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
print(sw_spacy)

Keluaran

{'those', 'on', 'own', '’ve', 'yourselves', 'around', 'between', 'four', 'been', 'alone', 'off', 'am', 'then', 'other', 'can', 'regarding', 'hereafter', 'front', 'too', 'used', 'wherein', '‘ll', 'doing', 'everything', 'up', 'onto', 'never', 'either', 'how', 'before', 'anyway', 'since', 'through', 'amount', 'now', 'he', 'was', 'have', 'into', 'because', 'not', 'therefore', 'they', 'n’t', 'even', 'whom', 'it', 'see', 'somewhere', 'thereupon', 'nothing', 'whereas', 'much', 'whenever', 'seem', 'until', 'whereby', 'at', 'also', 'some', 'last', 'than', 'get', 'already', 'our', 'once', 'will', 'noone', "'m", 'that', 'what', 'thus', 'no', 'myself', 'out', 'next', 'whatever', 'although', 'though', 'which', 'would', 'therein', 'nor', 'somehow', 'whereupon', 'besides', 'whoever', 'ourselves', 'few', 'did', 'without', 'third', 'anything', 'twelve', 'against', 'while', 'twenty', 'if', 'however', 'herself', 'when', 'may', 'ours', 'six', 'done', 'seems', 'else', 'call', 'perhaps', 'had', 'nevertheless', 'where', 'otherwise', 'still', 'within', 'its', 'for', 'together', 'elsewhere', 'throughout', 'of', 'others', 'show', '’s', 'anywhere', 'anyhow', 'as', 'are', 'the', 'hence', 'something', 'hereby', 'nowhere', 'latterly', 'say', 'does', 'neither', 'his', 'go', 'forty', 'put', 'their', 'by', 'namely', 'could', 'five', 'unless', 'itself', 'is', 'nine', 'whereafter', 'down', 'bottom', 'thereby', 'such', 'both', 'she', 'become', 'whole', 'who', 'yourself', 'every', 'thru', 'except', 'very', 'several', 'among', 'being', 'be', 'mine', 'further', 'n‘t', 'here', 'during', 'why', 'with', 'just', "'s", 'becomes', '’ll', 'about', 'a', 'using', 'seeming', "'d", "'ll", "'re", 'due', 'wherever', 'beforehand', 'fifty', 'becoming', 'might', 'amongst', 'my', 'empty', 'thence', 'thereafter', 'almost', 'least', 'someone', 'often', 'from', 'keep', 'him', 'or', '‘m', 'top', 'her', 'nobody', 'sometime', 'across', '‘s', '’re', 'hundred', 'only', 'via', 'name', 'eight', 'three', 'back', 'to', 'all', 'became', 'move', 'me', 'we', 'formerly', 'so', 'i', 'whence', 'under', 'always', 'himself', 'in', 'herein', 'more', 'after', 'themselves', 'you', 'above', 'sixty', 'them', 'your', 'made', 'indeed', 'most', 'everywhere', 'fifteen', 'but', 'must', 'along', 'beside', 'hers', 'side', 'former', 'anyone', 'full', 'has', 'yours', 'whose', 'behind', 'please', 'ten', 'seemed', 'sometimes', 'should', 'over', 'take', 'each', 'same', 'rather', 'really', 'latter', 'and', 'ca', 'hereupon', 'part', 'per', 'eleven', 'ever', '‘re', 'enough', "n't", 'again', '‘d', 'us', 'yet', 'moreover', 'mostly', 'one', 'meanwhile', 'whither', 'there', 'toward', '’m', "'ve", '’d', 'give', 'do', 'an', 'quite', 'these', 'everyone', 'towards', 'this', 'cannot', 'afterwards', 'beyond', 'make', 'were', 'whether', 'well', 'another', 'below', 'first', 'upon', 'any', 'none', 'many', 'serious', 'various', 're', 'two', 'less', '‘ve'}

Daftar yang cukup panjang. Mari kita periksa berapa banyak kata henti yang dimiliki perpustakaan ini

print(len(sw_spacy))

Keluaran

Wah, 326. Mari kita hapus kata-kata berhenti dari teks kita sebelumnya

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Kita dapat melihat dengan jelas bahwa penghapusan kata henti mengurangi panjang kalimat dari 129 menjadi 72, bahkan lebih pendek dari NLTK karena perpustakaan spaCy memiliki lebih banyak kata henti daripada NLTK. Hasilnya, dalam hal ini, cukup mirip

Gensim

Gensim (Hasilkan Mirip) adalah pustaka perangkat lunak sumber terbuka yang menggunakan pembelajaran mesin statistik modern. Menurut Wikipedia, Gensim dirancang untuk menangani koleksi teks besar menggunakan streaming data dan algoritme online tambahan, yang membedakannya dari kebanyakan paket perangkat lunak pembelajaran mesin lainnya yang hanya menargetkan pemrosesan dalam memori.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Daftar yang cukup panjang lagi. Mari kita periksa berapa banyak kata henti yang dimiliki perpustakaan ini

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Ummm. Jumlah yang sama dengan spaCy. Mari kita hapus kata-kata berhenti dari teks kita

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Kita dapat melihat bahwa cukup mudah untuk menghapus stopwords menggunakan library Gensim

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Penghapusan kata berhenti mengurangi panjang kalimat dari 129 menjadi 83. Kita bisa melihat bahwa meskipun panjang kata-kata berhenti di spaCy dan Gensim mirip, teks yang dihasilkan cukup berbeda

Scikit-Pelajari

Scikit-Learn tidak membutuhkan pengenalan. Ini adalah pustaka pembelajaran mesin perangkat lunak gratis untuk Python. Ini mungkin perpustakaan paling kuat untuk pembelajaran mesin

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Keluaran

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Daftar yang cukup panjang lagi. Mari kita periksa berapa banyak kata henti yang dimiliki perpustakaan ini

print(len(sw_nltk))

Keluaran

print(len(sw_nltk))

Mari kita hapus kata-kata berhenti dari teks kita

print(len(sw_nltk))

Keluaran

print(len(sw_nltk))

Penghapusan kata henti mengurangi panjang kalimat dari 129 menjadi 72. Kita dapat melihat bahwa Scikit-learn dan spaCy memberikan hasil yang sama

Bisakah saya menambahkan kata henti saya sendiri ke dalam daftar?

Ya, kami juga dapat menambahkan kata penghenti khusus ke daftar kata penghenti yang tersedia di perpustakaan ini untuk melayani tujuan kami

Berikut adalah kode untuk menambahkan beberapa kata berhenti khusus ke daftar kata berhenti NLTK

print(len(sw_nltk))

Keluaran

print(len(sw_nltk))

Kita dapat melihat bahwa panjang kata henti NLTK sekarang adalah 183, bukan 179. Dan, kita sekarang dapat menggunakan kode yang sama untuk menghapus kata berhenti dari teks kita

Bisakah saya menghapus kata-kata berhenti dari daftar premade?

Ya, jika kita mau, kita juga bisa menghapus kata-kata berhenti dari daftar yang tersedia di perpustakaan ini

Berikut adalah kode menggunakan perpustakaan NLTK

print(len(sw_nltk))

Stop word 'not' sekarang dihapus dari daftar stop words

Bergantung pada perpustakaan yang Anda gunakan, Anda dapat melakukan operasi yang relevan untuk menambah atau menghapus kata henti dari daftar yang dibuat sebelumnya. Saya menunjuk ini karena NLTK mengembalikan daftar kata berhenti sementara perpustakaan lain mengembalikan satu set kata berhenti

Jika kita tidak ingin menggunakan salah satu pustaka ini, kita juga dapat membuat daftar kata berhenti kustom kita sendiri dan menggunakannya dalam tugas kita. Ini biasanya dilakukan ketika kita memiliki keahlian domain di bidang kita dan ketika kita tahu kata mana yang harus kita hindari saat melakukan tugas kita

Lihatlah kode di bawah ini untuk melihat betapa sederhananya ini

print(len(sw_nltk))

Keluaran

print(len(sw_nltk))

Dengan cara yang sama, Anda dapat membuat daftar kata berhenti sesuai dengan tugas Anda dan menggunakannya. 🤟

Kami telah mengamati dalam artikel ini bahwa perpustakaan yang berbeda memiliki kumpulan kata henti yang berbeda dan kami dapat dengan jelas mengatakan bahwa kata henti adalah kata yang paling sering digunakan dalam bahasa apa pun

Meskipun Anda dapat menggunakan salah satu pustaka ini untuk menghapus kata berhenti dari teks Anda, namun sangat disarankan untuk menggunakan pustaka yang sama untuk seluruh tugas pra-pemrosesan teks Anda

Terima kasih, semuanya, telah membaca ini. Bagikan umpan balik atau saran Anda yang berharga mengenai pos ini. Selamat membaca. 📗 🖌

Bagaimana cara menghapus kata-kata berhenti khusus dengan Python?

Gunakan fungsi pengganti ekspresi reguler. Ganti setiap kecocokan dengan string kosong. Simpan jawaban ini

Apa itu stopwords dan bagaimana cara menghapus stopwords?

Apa itu kata henti? . Ini sebenarnya adalah kata-kata yang paling umum dalam bahasa apa pun (seperti artikel, preposisi, kata ganti, kata sambung, dll) dan tidak menambahkan banyak informasi ke teks. The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text.

Bagaimana Anda menghapus kata berhenti dan tanda baca dengan Python?

Untuk menghapus stopwords dan tanda baca menggunakan NLTK, kita harus mendownload semua stopwords menggunakan nltk. download('stopwords'), maka kita harus menentukan bahasa yang ingin kita hapus stopwordsnya, oleh karena itu, kita menggunakan stopwords. kata-kata ('bahasa Inggris') untuk menentukan dan menyimpannya ke variabel