创建网站能赚钱吗,茶道网站开发背景,做地方网站论坛赚钱,百度推广在哪里能看到声明#xff1a;以下内容仅供学习参考#xff0c;禁止用于任何商业用途
很久之前就想学爬虫了#xff0c;但是一直没机会#xff0c;这次终于有机会了
主要参考了《疯狂python讲义》的最后一章
首先安装Scrapy#xff1a;
pip install scrapy
然后创建爬虫项目#…声明以下内容仅供学习参考禁止用于任何商业用途
很久之前就想学爬虫了但是一直没机会这次终于有机会了
主要参考了《疯狂python讲义》的最后一章
首先安装Scrapy
pip install scrapy
然后创建爬虫项目
scrapy startproject 项目名
然后项目里面大概是长这样的 __pycache__是python缓存可以不管
scrapy.cfg是scrapy框架自带的配置文件这个项目里面可以不用改
settings.py里面是爬虫的设置可以在里面设置爬虫模仿的浏览器型号以及访问一个页面的延迟防止被反爬虫以及是否遵循爬虫规则、是否使用cookies等等
# Scrapy settings for exp1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME exp1SPIDER_MODULES [exp1.spiders]
NEWSPIDER_MODULE exp1.spiders# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT exp1 (http://www.yourdomain.com)# Obey robots.txt rules
ROBOTSTXT_OBEY True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN 16
#CONCURRENT_REQUESTS_PER_IP 16# Disable cookies (enabled by default)
COOKIES_ENABLED False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED False# Override the default request headers:
DEFAULT_REQUEST_HEADERS {User-Agent : Mozilla/5.0 (Windows NT 6.1; Win 64; x64; rv:61.0) Gecko/20100101Firefox/61.0,Accept : text/htmp,application/xhtmlxml,application/xml;q0.9,*/*;q0.8
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES {
# exp1.middlewares.Exp1SpiderMiddleware: 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES {
# exp1.middlewares.Exp1DownloaderMiddleware: 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS {
# scrapy.extensions.telnet.TelnetConsole: None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES {exp1.pipelines.Exp1Pipeline: 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED True
# The initial download delay
AUTOTHROTTLE_START_DELAY 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED True
#HTTPCACHE_EXPIRATION_SECS 0
#HTTPCACHE_DIR httpcache
#HTTPCACHE_IGNORE_HTTP_CODES []
#HTTPCACHE_STORAGE scrapy.extensions.httpcache.FilesystemCacheStorage# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION 2.7
TWISTED_REACTOR twisted.internet.asyncioreactor.AsyncioSelectorReactor
FEED_EXPORT_ENCODING utf-8爬虫需要爬取的内容定义在items.py里面
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Exp1Item(scrapy.Item):# define the fields for your item here like:# name scrapy.Field()# 电影名film_name scrapy.Field()# 评论用户名user_name scrapy.Field()# 评论时间comment_time scrapy.Field()# 评论内容comment scrapy.Field()
middlewares.py应该是中间件的定义不是很明白它的工作原理应该也不用改
pipelines.py是将爬取到的内容进行输出保存等处理的管道也负责一些预处理和后处理的任务例如把文件保存为json格式
# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass Exp1Pipeline(object):def __init__(self):self.json_file open(data6.json,wb)self.json_file.write([\n.encode(utf-8))def close_spider(self, spider):print(--------关闭文件--------)self.json_file.seek(-2,1)self.json_file.write(\n].encode(utf-8))self.json_file.close()def process_item(self, item, spider):print(电影名: ,item[film_name])print(评论用户: ,item[user_name])print(评论时间: ,item[comment_time])print(评论内容: ,item[comment])text json.dumps(dict(item), ensure_ascii False) ,\nself.json_file.write(text.encode(utf-8))spiders文件夹里面主要用来存储爬虫的主体
爬虫中的访问网页可以直接用scrapy.Request(网址,回调函数名,传参内容可选)接口进行访问下一个网页。
接下来就是愉快的解析网站时间
由于本人刚入门爬虫技术有限但是又时间紧迫所以就只爬了豆瓣top250的电影的短评
首先进入top250的界面按F12查看源码 我们在榜单需要爬取的只有当前页的所有电影链接和下页的链接
我们可以使用scrapy shell协助我们解析网站
首先在终端输入
scrapy shell -s USER_AGENTMozilla/5.0 网址
接下来就会出现一个很酷的scrapy终端 接下来我们先学习一下Xpath的用法
在F12打开的右半边窗口中点击选择按钮 选择一个幸运的网页对象点击一下 可以发现右侧窗口自动定位到了网页对应的源码
我们只需要从上到下找这些朝下的小三角形就知道我们选择的内容具体位于哪一层
最终发现是位于:电影网址
body-div(idwrapper)-div(idcontent)-div-div(classarticle)-ol
-li-div-div(classinfo)-div(classhd)-a节点的href属性 然后把上面这段话转换为Xpath的表达方式就是
response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/ol/li/div/div[classinfo]/div[classhd]/a/href)
具体解释一下
// 表示匹配任意位置的节点
/ 表示匹配根节点
. 表示匹配当前位置的节点
.. 表示匹配父节点 表示匹配属性
[] 里面表示对当前节点属性的限制需要加限制的节点一般都是同层级下有相同节点名的节点
然后把这段话输入进scrapy shell
发现它查出来了一大堆奇奇怪怪的东西 这其实只是我们没有对内容进行解包对Xpath的结果调用一下.extract()函数就行了 然后就会发现它的内容是一个列表
接下来我们只需要依葫芦画瓢就行把剩下的网站内容依次解析即可
至于为什么没有爬长评是因为不会处理javascript的网站但是短评就可以直接解析获取
不过selenium是可以做到的具体怎么做还需要进一步的学习挖坑
但是selenium速度似乎好像会慢一些 最后就是一些写爬虫的SB错误
之前不太理解yield的机制搞了半天发现爬虫爬取的顺序还是有很大问题结果是用了全局变量导致它发生数据读写冲突了然后就寄了最后通过Request传参再用dict保存每个回调函数中获取的下一页位置这才把问题解决了。另外那个cnt是因为豆瓣似乎限制了短评查看只能查看前10页后面会无权访问反正加一下也不是什么大事但是也要注意不要使用全局变量。
温馨提示这个代码爬两万条左右短评就会被封号为了完成5w条的作业要求被迫封了两个ip反反爬虫技术还是太菜了似乎豆瓣并不是通过ip访问频率来判断爬虫的而是用ip访问总次数来判断的
dbspd.py:
import scrapy
from exp1.items import Exp1Itemclass DoubanSpider(scrapy.Spider):name doubanallowed_domain [movie.douban.com]start_urls [https://movie.douban.com/top250?start0filter]comment_page_sub /comments?sorttimestatusPcnt {}film_pre {}def parse_film(self, response):film_name (response.xpath(//body/div[idwrapper]/div[idcontent]/h1/text()).extract())[0].split( )[0]self.film_pre[film_name] response.meta[fp]user_name_list response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/div[idcomments]/div/div[classcomment]/h3/span[classcomment-info]/a/text()).extract()comment_time_list response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/div[idcomments]/div/div[classcomment]/h3/span[classcomment-info]/span[classcomment-time ]/text()).extract()comment_list response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/div[idcomments]/div/div[classcomment]/p/span/text()).extract()for (user_name,comment_time,comment) in zip(user_name_list,comment_time_list,comment_list):item Exp1Item()item[film_name] film_nameitem[user_name] user_nameitem[comment_time] comment_time.split( )[20] comment_time.split( )[21][0:8]item[comment] commentyield itemnew_links_sub response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/div[idcomments]/div[idpaginator]/a[classnext]/href).extract()this_film_pre self.film_pre.get(film_name, self.film_pre)this_film_cnt self.cnt.get(film_name, 0)if(this_film_cnt 8):print(next_page url:,this_film_pre comments new_links_sub[0])self.cnt[film_name] this_film_cnt 1yield scrapy.Request(this_film_pre comments new_links_sub[0], callbackself.parse_film, meta{fp:self.film_pre[film_name]})def parse(self, response):film_list response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/ol/li/div/div[classinfo]/div[classhd]/a/href).extract()for film_pre in film_list:yield scrapy.Request(film_pre self.comment_page_sub , callbackself.parse_film, meta{fp:film_pre})new_links_sub response.xpath(//body/div[idwrapper]/div[idcontent]/div/div[classarticle]/div[classpaginator]/span[classnext]/a/href).extract()print(next_rank_page url:,https://movie.douban.com/top250 new_links_sub[0])yield scrapy.Request(https://movie.douban.com/top250 new_links_sub[0], callbackself.parse)最后用命令
scrapy crawl douban
就可以运行我们的爬虫啦 最后的最后再提一嘴分词
先装一个jieba库
pip install jieba
然后直接开用lcut()函数就行了禁用词列表在stop.txt中一行一个
另外不知道为什么换行等一些空字符也会出现在分词结果中所以还得单独处理一下。
import jieba
import json
stop open(stop.txt, r, encodingutf-8).read()
stop stop.splitlines()
# print(stop)txt open(film_comment_data.txt, r, encodingutf-8).read()
d json.loads(txt)
cnt 0
word_file open(word.txt,wb)
for item in d:words jieba.lcut(item[comment])print(cnt)cnt1for word in words:if word in stop:passelse:if word ! \n:word word \nword_file.write(word.encode(utf-8))
word_file.close()