广西网站建设教程,wordpress 图片管理插件,淮南政务,织梦wap网站模板原标题#xff1a;用Python对哈利波特系列小说进行情感分析准备数据现有的数据是一部小说放在一个txt里#xff0c;我们想按照章节(列表中第一个就是章节1的内容#xff0c;列表中第二个是章节2的内容)进行分析#xff0c;这就需要用到正则表达式整理数据。比如我们先看看 …原标题用Python对哈利波特系列小说进行情感分析准备数据现有的数据是一部小说放在一个txt里我们想按照章节(列表中第一个就是章节1的内容列表中第二个是章节2的内容)进行分析这就需要用到正则表达式整理数据。比如我们先看看 01-Harry Potter and the Sorcerers Stone.txt 里的章节情况我们打开txt经过检索发现所有章节存在规律性表达[Chapter][空格][整数][换行符n][可能含有空格的英文标题][换行符n]我们先熟悉下正则使用这个设计一个模板pattern提取章节信息import reimport nltkraw_text open(data/01-Harry Potter and the Sorcerers Stone.txt).readpattern Chapter dn[a-zA-Z ]nre.findall(pattern, raw_text)[Chapter 1nThe Boy Who Livedn,Chapter 2nThe Vanishing Glassn,Chapter 3nThe Letters From No Onen,Chapter 4nThe Keeper Of The Keysn,Chapter 5nDiagon Alleyn,Chapter 7nThe Sorting Hatn,Chapter 8nThe Potions Mastern,Chapter 9nThe Midnight Dueln,Chapter 10nHalloweenn,Chapter 11nQuidditchn,Chapter 12nThe Mirror Of Erisedn,Chapter 13nNicholas Flameln,Chapter 14nNorbert the Norwegian Ridgebackn,Chapter 15nThe Forbidden Forestn,Chapter 16nThrough the Trapdoorn,Chapter 17nThe Man With Two Facesn]熟悉上面的正则表达式操作我们想更精准一些。我准备了一个test文本与实际小说中章节目录表达相似只不过文本更短更利于理解。按照我们的预期我们数据中只有5个章节那么列表的长度应该是5。这样操作后的列表中第一个内容就是章节1的内容列表中第二个内容是章节2的内容。import retest Chapter 1nThe Boy Who LivednMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,Chapter 2nThe Vanishing GlassnFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.Chapter 3nThe Letters From No OnenThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.Chapter 4nThe Keeper Of The KeysnHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.Chapter 5nDiagon AlleynIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. #获取章节内容列表(列表中第一个内容就是章节1的内容列表中第二个内容是章节2的内容)#为防止列表中有空内容这里加了一个条件判断保证列表长度与章节数预期一致chapter_contents [c for c in re.split(Chapter dn[a-zA-Z ]n, test) if c]chapter_contents[Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,n ,For a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.n ,The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.n ,He didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin. n ,It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. ]能得到哈利波特的章节内容列表也就意味着我们可以做真正的文本分析了数据分析章节数对比import osimport reimport matplotlib.pyplot as pltcolors [#78C850, #A8A878,#F08030,#C03028,#6890F0, #A890F0,#A040A0]harry_potters [Harry Potter and the Sorcerers Stone.txt,Harry Potter and the Chamber of Secrets.txt,Harry Potter and the Prisoner of Azkaban.txt,Harry Potter and the Goblet of Fire.txt,Harry Potter and the Order of the Phoenix.txt,Harry Potter and the Half-Blood Prince.txt,Harry Potter and the Deathly Hallows.txt]#横坐标为小说名harry_potter_names [n.replace(Harry Potter and the , )[:-4]for n in harry_potters]#纵坐标为章节数chapter_nums []for harry_potter in harry_potters:file data/harry_potterraw_text open(file).readpattern Chapter dn[a-zA-Z ]nchapter_contents [c for c in re.split(pattern, raw_text) if c]chapter_nums.append(len(chapter_contents))#设置画布尺寸plt.figure(figsize(20, 10))#图的名字字体大小粗体plt.title(Chapter Number of Harry Potter, fontsize25, weightbold)#绘制带色条形图plt.bar(harry_potter_names, chapter_nums, colorcolors)#横坐标刻度上的字体大小及倾斜角度plt.xticks(rotation25, fontsize16, weightbold)plt.yticks(fontsize16, weightbold)#坐标轴名字plt.xlabel(Harry Potter Series, fontsize20, weightbold)plt.ylabel(Chapter Number, rotation25, fontsize20, weightbold)plt.show从上面可以看出哈利波特系列小说的后四部章节数据较多(这分析没啥大用处主要是练习)用词丰富程度如果说一句100个词的句子同时词语不带重样的那么用词的丰富程度为100。而如果说同样长度的句子只用到20个词语那么用词的丰富程度为100/205。import osimport reimport matplotlib.pyplot as pltfrom nltk import word_tokenizefrom nltk.stem.snowball importSnowballStemmerplt.style.use(fivethirtyeight)colors [#78C850, #A8A878,#F08030,#C03028,#6890F0, #A890F0,#A040A0]harry_potters [Harry Potter and the Sorcerers Stone.txt,Harry Potter and the Chamber of Secrets.txt,Harry Potter and the Prisoner of Azkaban.txt,Harry Potter and the Goblet of Fire.txt,Harry Potter and the Order of the Phoenix.txt,Harry Potter and the Half-Blood Prince.txt,Harry Potter and the Deathly Hallows.txt]#横坐标为小说名harry_potter_names [n.replace(Harry Potter and the , )[:-4]for n in harry_potters]#用词丰富程度richness_of_words []stemmer SnowballStemmer(english)for harry_potter in harry_potters:file data/harry_potterraw_text open(file).readwords word_tokenize(raw_text)words [stemmer.stem(w.lower) for w in words]wordset set(words)richness len(words)/len(wordset)richness_of_words.append(richness)#设置画布尺寸plt.figure(figsize(20, 10))#图的名字字体大小粗体plt.title(The Richness of Word in Harry Potter, fontsize25, weightbold)#绘制带色条形图plt.bar(harry_potter_names, richness_of_words, colorcolors)#横坐标刻度上的字体大小及倾斜角度plt.xticks(rotation25, fontsize16, weightbold)plt.yticks(fontsize16, weightbold)#坐标轴名字plt.xlabel(Harry Potter Series, fontsize20, weightbold)plt.ylabel(Richness of Words, rotation25, fontsize20, weightbold)plt.show情感分析哈利波特系列小说情绪发展趋势这里使用VADER,有现成的库vaderSentiment这里使用其中的polarity_scores函数可以得到neg:负面得分neu中性得分pos积极得分compound: 综合情感得分from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzeranalyzer SentimentIntensityAnalyzertest i am so sorryanalyzer.polarity_scores(test){neg: 0.443, neu: 0.557, pos: 0.0, compound: -0.1513}import osimport reimport matplotlib.pyplot as pltfrom nltk.tokenize import sent_tokenizefrom vaderSentiment.vaderSentiment importSentimentIntensityAnalyzerharry_potters [Harry Potter and the Sorcerers Stone.txt,Harry Potter and the Chamber of Secrets.txt,Harry Potter and the Prisoner of Azkaban.txt,Harry Potter and the Goblet of Fire.txt,Harry Potter and the Order of the Phoenix.txt,Harry Potter and the Half-Blood Prince.txt,Harry Potter and the Deathly Hallows.txt]#横坐标为章节序列chapter_indexes []#纵坐标为章节情绪得分compounds []analyzer SentimentIntensityAnalyzerchapter_index 1for harry_potter in harry_potters:file data/harry_potterraw_text open(file).readpattern Chapter dn[a-zA-Z ]nchapters [c for c in re.split(pattern, raw_text) if c]#计算每个章节的情感得分for chapter in chapters:compound 0sentences sent_tokenize(chapter)for sentence in sentences:score analyzer.polarity_scores(sentence)compound score[compound]compounds.append(compound/len(sentences))chapter_indexes.append(chapter_index)chapter_index1#设置画布尺寸plt.figure(figsize(20, 10))#图的名字字体大小粗体plt.title(Average Sentiment of the Harry Potter, fontsize25, weightbold)#绘制折线图plt.plot(chapter_indexes, compounds, color#A040A0)#横坐标刻度上的字体大小及倾斜角度plt.xticks(rotation25, fontsize16, weightbold)plt.yticks(fontsize16, weightbold)#坐标轴名字plt.xlabel(Chapter, fontsize20, weightbold)plt.ylabel(Average Sentiment, rotation25, fontsize20, weightbold)plt.show曲线不够平滑为了熨平曲线波动自定义了一个函数import numpy as npimport osimport reimport matplotlib.pyplot as pltfrom nltk.tokenize import sent_tokenizefrom vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer#曲线平滑函数def movingaverage(value_series, window_size):window np.ones(int(window_size))/float(window_size)return np.convolve(value_series, window, same)harry_potters [Harry Potter and the Sorcerers Stone.txt,Harry Potter and the Chamber of Secrets.txt,Harry Potter and the Prisoner of Azkaban.txt,Harry Potter and the Goblet of Fire.txt,Harry Potter and the Order of the Phoenix.txt,Harry Potter and the Half-Blood Prince.txt,Harry Potter and the Deathly Hallows.txt]#横坐标为章节序列chapter_indexes []#纵坐标为章节情绪得分compounds []analyzer SentimentIntensityAnalyzerchapter_index 1for harry_potter in harry_potters:file data/harry_potterraw_text open(file).readpattern Chapter dn[a-zA-Z ]nchapters [c for c in re.split(pattern, raw_text) if c]#计算每个章节的情感得分for chapter in chapters:compound 0sentences sent_tokenize(chapter)for sentence in sentences:score analyzer.polarity_scores(sentence)compound score[compound]compounds.append(compound/len(sentences))chapter_indexes.append(chapter_index)chapter_index1#设置画布尺寸plt.figure(figsize(20, 10))#图的名字字体大小粗体plt.title(Average Sentiment of the Harry Potter,fontsize25,weightbold)#绘制折线图plt.plot(chapter_indexes, compounds,colorred)plt.plot(movingaverage(compounds, 10),colorblack,linestyle:)#横坐标刻度上的字体大小及倾斜角度plt.xticks(rotation25,fontsize16,weightbold)plt.yticks(fontsize16,weightbold)#坐标轴名字plt.xlabel(Chapter,fontsize20,weightbold)plt.ylabel(Average Sentiment,rotation25,fontsize20,weightbold)plt.show全新打卡学习模式每天30分钟30天学会Python编程世界正在奖励坚持学习的人返回搜狐查看更多责任编辑