Estate atento ya que en esta reseña vas a hallar la solución que buscas.Esta sección ha sido evaluado por nuestros especialistas para garantizar la calidad y exactitud de nuestro contenido.
Solución:
Esto puede ayudar a quienes buscan una respuesta para esta pregunta.
import spacy #load spacy
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
stops = stopwords.words("english")
def normalize(comment, lowercase, remove_stopwords):
if lowercase:
comment = comment.lower()
comment = nlp(comment)
lemmatized = list()
for word in comment:
lemma = word.lemma_.strip()
if lemma:
if not remove_stopwords or (remove_stopwords and lemma not in stops):
lemmatized.append(lemma)
return " ".join(lemmatized)
Data['Text_After_Clean'] = Data['Text'].apply(normalize, lowercase=True, remove_stopwords=True)
Se puede hacer fácilmente a través de unos pocos comandos. También tenga en cuenta que el espacio no admite la derivación. Puedes referir esto a este hilo
import spacy
nlp = spacy.load('en')
# sample text
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown
printer took a galley of type and scrambled it to make a type specimen book. It has survived not
only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration
in some form, by injected humour, or randomised words which don't look even slightly believable. If you are
going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the
middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary,
making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined
with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated
Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."""
# convert the text to a spacy document
document = nlp(text) # all spacy documents are tokenized. You can access them using document[i]
document[0:10] # = Lorem Ipsum is simply dummy text of the printing and
#the good thing about spacy is a lot of things like lemmatization etc are done when you convert them to a spacy document `using nlp(text)`. You can access sentences using document.sents
list(document.sents)[0]
# lemmatized words can be accessed using document[i].lemma_ and you can check
# if a word is a stopword by checking the `.is_stop` attribute of the word.
# here I am extracting the lemmatized form of each word provided they are not a stop word
lemmas = [token.lemma_ for token in document if not token.is_stop]
La mejor canalización que he encontrado hasta ahora es de los pasos de preprocesamiento de texto del artículo medio de Maksym Balatsko y la canalización universal reutilizable. La mejor parte es que podemos usarlo como parte de la canalización de transformadores de scikit-learn y admite multiproceso:
import numpy as np
import multiprocessing as mp
import string
import spacy
import en_core_web_sm
from nltk.tokenize import word_tokenize
from sklearn.base import TransformerMixin, BaseEstimator
from normalise import normalise
nlp = en_core_web_sm.load()
class TextPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self,
variety="BrE",
user_abbrevs=,
n_jobs=1):
"""
Text preprocessing transformer includes steps:
1. Text normalization
2. Punctuation removal
3. Stop words removal
4. Lemmatization
variety - format of date (AmE - american type, BrE - british format)
user_abbrevs - dict of user abbreviations mappings (from normalise package)
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
partitions = 1
cores = mp.cpu_count()
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
return X_copy.apply(self._preprocess_text)
else:
partitions = min(self.n_jobs, cores)
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(self._preprocess_part, data_split))
pool.close()
pool.join()
return data
def _preprocess_part(self, part):
return part.apply(self._preprocess_text)
def _preprocess_text(self, text):
normalized_text = self._normalize(text)
doc = nlp(normalized_text)
removed_punct = self._remove_punct(doc)
removed_stop_words = self._remove_stop_words(removed_punct)
return self._lemmatize(removed_stop_words)
def _normalize(self, text):
# some issues in normalise package
try:
return ' '.join(normalise(text, variety=self.variety, user_abbrevs=self.user_abbrevs, verbose=False))
except:
return text
def _remove_punct(self, doc):
return [t for t in doc if t.text not in string.punctuation]
def _remove_stop_words(self, doc):
return [t for t in doc if not t.is_stop]
def _lemmatize(self, doc):
return ' '.join([t.lemma_ for t in doc])
Puedes usarlo como:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
# ... assuming data split X_train, X_test ...
clf = Pipeline(steps=[
('normalize': TextPreprocessor(n_jobs=-1),
('features', TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True)),
('classifier', LogisticRegressionCV(cv=5,solver='saga',scoring='accuracy', n_jobs=-1, verbose=1))
])
clf.fit(X_train, y_train)
clf.predict(X_test)
X_train son datos que pasarán por TextPreprocessing, luego extraemos características y luego pasamos a un clasificador.
Si entiendes que ha resultado útil este artículo, sería de mucha ayuda si lo compartieras con más juniors de este modo contrubuyes a dar difusión a nuestro contenido.