关于NLP中的文本预处理的完整教程

作者：问答酱2025.10.10 14:59浏览量：1

简介：本文详细阐述NLP中文本预处理的全流程，包括数据清洗、标准化、分词与词干提取等核心步骤，并附Python代码示例，助力开发者高效构建文本处理管线。

关于NLP中的文本预处理的完整教程

引言

自然语言处理（NLP）作为人工智能的核心分支，其性能高度依赖数据质量。文本预处理作为NLP任务的首要环节，直接影响模型训练效率与最终效果。本文将系统梳理文本预处理的全流程，涵盖数据清洗、标准化、分词与词干提取等关键步骤，并提供Python代码示例，帮助开发者构建高效、鲁棒的文本处理管线。

一、数据清洗：去除噪声，提升数据质量

数据清洗是文本预处理的第一步，旨在消除原始数据中的无关信息，包括HTML标签、特殊符号、停用词等。这些噪声会干扰模型学习语义特征，降低分类或生成任务的准确性。

1.1 去除HTML标签与特殊符号

网页爬取的数据常包含HTML标签（如<div>、<p>）和特殊符号（如@、#）。可通过正则表达式或专用库（如BeautifulSoup）进行清洗。

代码示例：

from bs4 import BeautifulSoup
import re
def clean_html(text):
    soup = BeautifulSoup(text, "html.parser")
    clean_text = soup.get_text()
    clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', clean_text)  # 移除非字母数字字符
    return clean_text
raw_text = "<p>Hello, world! @NLP</p>"
print(clean_html(raw_text))  # 输出: "Hello world NLP"

1.2 处理停用词

停用词（如“的”、“是”、“and”）虽无实际语义，但高频出现会占用计算资源。可通过NLTK或自定义停用词列表过滤。

代码示例：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)
text = "This is an example sentence."
print(remove_stopwords(text))  # 输出: "example sentence"

二、文本标准化：统一格式，减少变异

标准化旨在将文本转换为统一格式，包括大小写转换、词形还原、拼写纠正等，以降低数据稀疏性。

2.1 大小写转换

统一大小写可避免因大小写差异导致的特征分散。例如，“Apple”和“apple”应视为同一词。

代码示例：

def lowercase_text(text):
    return text.lower()
text = "Hello World!"
print(lowercase_text(text))  # 输出: "hello world!"

2.2 词形还原与词干提取

词形还原（Lemmatization）将单词还原为词典形式（如“running”→“run”），而词干提取（Stemming）通过规则截断单词（如“running”→“runi”）。前者更准确但计算量更大。

代码示例（词形还原）：

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    words = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmas)
text = "running dogs are barking"
print(lemmatize_text(text))  # 输出: "running dog are barking"

代码示例（词干提取）：

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def stem_text(text):
    words = word_tokenize(text)
    stems = [stemmer.stem(word) for word in words]
    return ' '.join(stems)
text = "running dogs are barking"
print(stem_text(text))  # 输出: "run dog are bark"

三、分词与向量化：构建模型输入

分词将文本拆分为单词或子词单元，向量化则将文本转换为数值形式，供模型处理。

3.1 分词技术

分词方法包括空格分词、正则表达式分词和基于统计的分词（如BPE）。中文需特殊处理（如结巴分词）。

代码示例（英文分词）：

from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fun."
tokens = word_tokenize(text)
print(tokens)  # 输出: ['Natural', 'Language', 'Processing', 'is', 'fun', '.']

代码示例（中文分词）：

import jieba
text = "自然语言处理很有趣"
seg_list = jieba.cut(text)
print(" ".join(seg_list))  # 输出: "自然 语言 处理 很 有趣"

3.2 向量化方法

向量化包括词袋模型（Bag-of-Words）、TF-IDF和词嵌入（如Word2Vec、BERT）。TF-IDF通过词频-逆文档频率加权，突出重要词汇。

代码示例（TF-IDF）：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # 输出特征词列表
print(X.toarray())  # 输出TF-IDF矩阵

四、高级预处理技术

4.1 处理拼写错误

拼写错误会引入噪声，可通过textblob库自动纠正。

代码示例：

from textblob import TextBlob
text = "I havv a good speling!"
blob = TextBlob(text)
corrected_text = str(blob.correct())
print(corrected_text)  # 输出: "I have a good spelling!"

4.2 处理缩写与俚语

缩写（如“u”→“you”）和俚语需通过映射表或上下文解析处理。

代码示例（自定义映射表）：

abbreviation_map = {
    "u": "you",
    "ur": "your",
    "lol": "laugh out loud"
}
def expand_abbreviations(text):
    for abbr, full in abbreviation_map.items():
        text = text.replace(abbr, full)
    return text
text = "u r lol!"
print(expand_abbreviations(text))  # 输出: "you are laugh out loud!"

五、预处理管线构建

将上述步骤组合为预处理管线，可提升代码复用性。

代码示例：

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
class TextPreprocessor:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
    def clean(self, text):
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        return text
    def remove_stopwords(self, text):
        words = word_tokenize(text)
        filtered_words = [self.lemmatizer.lemmatize(word.lower()) for word in words if word.lower() not in self.stop_words]
        return ' '.join(filtered_words)
    def preprocess(self, text):
        cleaned = self.clean(text)
        processed = self.remove_stopwords(cleaned)
        return processed
preprocessor = TextPreprocessor()
text = "Hello, world! This is an example sentence with stopwords."
print(preprocessor.preprocess(text))  # 输出: "hello world example sentence stopword"

结论

文本预处理是NLP任务中不可或缺的环节，其质量直接影响模型性能。通过系统化的数据清洗、标准化、分词与向量化，可显著提升数据质量。开发者应根据任务需求选择合适的预处理技术，并构建可复用的预处理管线，以应对不同场景的挑战。未来，随着预训练模型的发展，预处理的重要性将进一步凸显，成为连接原始数据与模型的关键桥梁。

发表评论

开发者关注产品榜

最热文章

关于作者

被阅读数
被赞数
被收藏数

活动

咨询

开发者热搜

关于NLP中的文本预处理的完整教程

关于NLP中的文本预处理的完整教程

引言

一、数据清洗：去除噪声，提升数据质量

1.1 去除HTML标签与特殊符号

1.2 处理停用词

二、文本标准化：统一格式，减少变异

2.1 大小写转换

2.2 词形还原与词干提取

三、分词与向量化：构建模型输入

3.1 分词技术

3.2 向量化方法

四、高级预处理技术

4.1 处理拼写错误

4.2 处理缩写与俚语

五、预处理管线构建

结论

相关文章推荐

文心一言接入指南：通过百度智能云千帆大模型平台API调用

从 MLOps 到 LMOps 的关键技术嬗变

Sugar BI教你怎么做数据可视化 - 拓扑图，让节点连接信息一目了然

更轻量的百度百舸，CCE Stack 智算版发布

打造合规数据闭环，加速自动驾驶技术研发

LMOps 工具链与千帆大模型平台

发表评论

开发者关注产品榜

百度千帆·大模型服务及Agent开发平台

百度千帆·数据智能平台

秒哒-生成式应用开发平台

百度智能云客悦智能客服平台

最热文章

关于作者