诸城高密网站建设,ps做图游戏下载网站有哪些内容,阜阳网站制作公司找哪家,jsp写的网站阅读目录 一 介绍二 安装三 命令行工具四 项目结构以及爬虫应用简介 五 Spiders六 Selectors七 Items八 Item Pipeline九 Dowloader Middeware十 Spider Middleware十一 settings.py十二 爬取亚马逊商品信息一 介绍 Scrapy一个开源和协作的框架#xff0c;其最初是为了页面抓取… 阅读目录 一 介绍二 安装三 命令行工具四 项目结构以及爬虫应用简介 五 Spiders六 Selectors七 Items八 Item Pipeline九 Dowloader Middeware十 Spider Middleware十一 settings.py十二 爬取亚马逊商品信息 一 介绍 Scrapy一个开源和协作的框架其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛可用于如数据挖掘、监测和自动化测试等领域也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞又名异步的代码来实现并发。整体架构大致如下 The data flow in Scrapy is controlled by the execution engine, and goes like this: The Engine gets the initial Requests to crawl from the Spider.The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.The Scheduler returns the next Requests to the Engine.The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.The process repeats (from step 1) until there are no more requests from the Scheduler. Components 引擎(EGINE) 引擎负责控制系统所有组件之间的数据流并在某些动作发生时触发事件。有关详细信息请参见上面的数据流部分。 调度器(SCHEDULER)用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址下载器(DOWLOADER)用于下载网页内容, 并将网页内容返回给EGINE下载器是建立在twisted这个高效的异步模型上的爬虫(SPIDERS)SPIDERS是开发人员自定义的类用来解析responses并且提取items或者发送新的请求项目管道(ITEM PIPLINES)在items被提取后负责处理它们主要包括清理、验证、持久化比如存到数据库等操作下载器中间件(Downloader Middlewares)位于Scrapy引擎和下载器之间主要用来处理从EGINE传到DOWLOADER的请求request已经从DOWNLOADER传到EGINE的响应response你可用该中间件做以下几件事process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);change received response before passing it to a spider;send a new Request instead of passing received response to a spider;pass response to a spider without fetching a web page;silently drop some requests.爬虫中间件(Spider Middlewares)位于EGINE和SPIDERS之间主要工作是处理SPIDERS的输入即responses和输出即requests官网链接https://docs.scrapy.org/en/latest/topics/architecture.html 二 安装 #Windows平台1、pip3 install wheel #安装后便支持通过wheel文件安装软件wheel文件官网https://www.lfd.uci.edu/~gohlke/pythonlibs3、pip3 install lxml4、pip3 install pyopenssl5、下载并安装pywin32https://sourceforge.net/projects/pywin32/files/pywin32/6、下载twisted的wheel文件http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl8、pip3 install scrapy#Linux平台1、pip3 install scrapy 三 命令行工具 #1 查看帮助scrapy -hscrapy command -h#2 有两种命令其中Project-only必须切到项目文件夹下才能执行而Global的命令则不需要Global commands:startproject #创建项目genspider #创建爬虫程序settings #如果是在项目目录下则得到的是该项目的配置runspider #运行一个独立的python文件不必创建项目shell #scrapy shell url地址 在交互式调试如选择器规则正确与否fetch #独立于程单纯地爬取一个页面可以拿到请求头view #下载完毕后直接弹出浏览器以此可以分辨出哪些数据是ajax请求version #scrapy version 查看scrapy的版本scrapy version -v查看scrapy依赖库的版本Project-only commands:crawl #运行爬虫必须创建项目才行确保配置文件中ROBOTSTXT_OBEY Falsecheck #检测项目中有无语法错误list #列出项目中所包含的爬虫名edit #编辑器一般不用parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确bench #scrapy bentch压力测试#3 官网链接https://docs.scrapy.org/en/latest/topics/commands.html #1、执行全局命令请确保不在某个项目的目录下排除受该项目配置的影响
scrapy startproject MyProjectcd MyProject
scrapy genspider baidu www.baidu.comscrapy settings --get XXX #如果切换到项目目录下看到的则是该项目的配置scrapy runspider baidu.pyscrapy shell https://www.baidu.comresponseresponse.statusresponse.bodyview(response)scrapy view https://www.taobao.com #如果页面显示内容不全不全的内容则是ajax请求实现的以此快速定位问题scrapy fetch --nolog --headers https://www.taobao.comscrapy version #scrapy的版本scrapy version -v #依赖库的版本#2、执行项目命令切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench 示范用法 四 项目结构以及爬虫应用简介 project_name/scrapy.cfgproject_name/__init__.pyitems.pypipelines.pysettings.pyspiders/__init__.py爬虫1.py爬虫2.py爬虫3.py 文件说明 scrapy.cfg 项目的主配置信息用来部署scrapy时使用爬虫相关的配置信息在settings.py文件中。items.py 设置数据存储模板用于结构化数据如Django的Modelpipelines 数据处理行为如一般结构化的数据持久化settings.py 配置文件如递归的层数、并发数延迟下载等。强调:配置文件的选项必须大写否则视为无效正确写法USER_AGENTxxxxspiders 爬虫目录如创建文件编写爬虫规则注意一般创建爬虫文件时以网站域名命名 #在项目目录下新建entrypoint.py
from scrapy.cmdline import execute
execute([scrapy, crawl, xiaohua]) 默认只能在cmd中执行爬虫如果想在pycharm中执行需要做 import sys,os
sys.stdoutio.TextIOWrapper(sys.stdout.buffer,encodinggb18030) 关于windows编码 五 Spiders 1、介绍 #1、Spiders是由一系列类定义了一个网址或一组网址将被爬取组成具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。#2、换句话说Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方 2、Spiders会循环做如下事情 #1、生成初始的Requests来爬取第一个URLS并且标识一个回调函数
第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发#2、在回调函数中解析response并且返回值
返回值可以4种包含解析数据的字典Item对象新的Request对象新的Requests也需要指定一个回调函数或者是可迭代对象包含Items或Request#3、在回调函数中解析页面内容
通常使用Scrapy自带的Selectors但很明显你也可以使用Beutifulsouplxml或其他你爱用啥用啥。#4、最后针对返回的Items对象将会被持久化到数据库
通过Item Pipeline组件存到数据库https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline
或者导出到不同的文件通过Feed exportshttps://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports 3、Spiders总共提供了五种类 #1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider 4、导入使用 # -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpiderclass AmazonSpider(scrapy.Spider): #自定义类继承Spiders提供的基类name amazonallowed_domains [www.amazon.cn]start_urls [http://www.amazon.cn/]def parse(self, response):pass 5、class scrapy.spiders.Spider 这是最简单的spider类任何其他的spider类都需要继承它包含你自己定义的。 该类不提供任何特殊的功能它仅提供了一个默认的start_requests方法默认从start_urls中读取url地址发送requests请求并且默认parse作为回调函数 class AmazonSpider(scrapy.Spider):name amazon allowed_domains [www.amazon.cn] start_urls [http://www.amazon.cn/]custom_settings {BOT_NAME : Egon_Spider_Amazon,REQUEST_HEADERS : {Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,Accept-Language: en,}}def parse(self, response):pass #1、name amazon
定义爬虫名scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一In Python 2 this must be ASCII only.#2、allowed_domains [www.amazon.cn]
定义允许爬取的域名如果OffsiteMiddleware启动默认就启动
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为https://www.example.com/1.html那就添加example.com到列表.#3、start_urls [http://www.amazon.cn/]
如果没有指定url就从该列表中读取url来生成第一个请求#4、custom_settings
值为一个字典定义一些配置信息在运行爬虫程序时这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性由于settings会在类实例化前被加载#5、settings
通过self.settings[配置项的名字]可以访问settings.py中的配置如果自己定义了custom_settings还是以自己的为准#6、logger
日志名默认为spider的名字
self.logger.debug(%s %self.settings[BOT_NAME])#5、crawler了解
该属性必须被定义到类方法from_crawler中#6、from_crawler(crawler, *args, **kwargs)了解
You probably won’t need to override this directly because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.#7、start_requests()
该方法用来发起第一个Requests请求且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用Scrapy只调用它一次。
默认从start_urls里取出每个url来生成Request(url, dont_filterTrue)#针对参数dont_filter,请看自定义去重规则如果你想要改变起始爬取的Requests你就需要覆盖这个方法例如你想要起始发送一个POST请求如下
class MySpider(scrapy.Spider):name myspiderdef start_requests(self):return [scrapy.FormRequest(http://www.example.com/login,formdata{user: john, pass: secret},callbackself.logged_in)]def logged_in(self, response):# here you would extract links to follow and return Requests for# each of them, with another callbackpass#8、parse(response)
这是默认的回调函数所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.#9、log(message[, level, component])了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.#10、closed(reason)
爬虫程序结束时自动触发 定制scrapy.spider属性与方法详解 去重规则应该多个爬虫共享的但凡一个爬虫爬取了其他都不要爬了实现方式如下#方法一
1、新增类属性
visitedset() #类属性2、回调函数parse方法内
def parse(self, response):if response.url in self.visited:return None.......self.visited.add(response.url) #方法一改进针对url可能过长所以我们存放url的hash值
def parse(self, response):urlmd5(response.request.url)if url in self.visited:return None.......self.visited.add(url) #方法二Scrapy自带去重功能
配置文件
DUPEFILTER_CLASS scrapy.dupefilter.RFPDupeFilter #默认的去重规则帮我们去重去重规则在内存中
DUPEFILTER_DEBUG False
JOBDIR 保存范文记录的日志路径如/root/ # 最终路径为 /root/requests.seen去重规则放文件中scrapy自带去重规则默认为RFPDupeFilter只需要我们指定
Request(...,dont_filterFalse) 如果dont_filterTrue则告诉Scrapy这个URL不参与去重。#方法三
我们也可以仿照RFPDupeFilter自定义去重规则from scrapy.dupefilter import RFPDupeFilter看源码仿照BaseDupeFilter#步骤一在项目目录下自定义去重文件dup.py
class UrlFilter(object):def __init__(self):self.visited set() #或者放到数据库
classmethoddef from_settings(cls, settings):return cls()def request_seen(self, request):if request.url in self.visited:return Trueself.visited.add(request.url)def open(self): # can return deferredpassdef close(self, reason): # can return a deferredpassdef log(self, request, spider): # log that a request has been filteredpass#步骤二配置文件settings.py
DUPEFILTER_CLASS 项目名.dup.UrlFilter# 源码分析
from scrapy.core.scheduler import Scheduler
见Scheduler下的enqueue_request方法self.df.request_seen(request) 去重规则去除重复的url #例一
import scrapyclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com]start_urls [http://www.example.com/1.html,http://www.example.com/2.html,http://www.example.com/3.html,]def parse(self, response):self.logger.info(A response from %s just arrived!, response.url)#例二一个回调函数返回多个Requests和Items
import scrapyclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com]start_urls [http://www.example.com/1.html,http://www.example.com/2.html,http://www.example.com/3.html,]def parse(self, response):for h3 in response.xpath(//h3).extract():yield {title: h3}for url in response.xpath(//a/href).extract():yield scrapy.Request(url, callbackself.parse)#例三在start_requests()内直接指定起始爬取的urlsstart_urls就没有用了import scrapy
from myproject.items import MyItemclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com]def start_requests(self):yield scrapy.Request(http://www.example.com/1.html, self.parse)yield scrapy.Request(http://www.example.com/2.html, self.parse)yield scrapy.Request(http://www.example.com/3.html, self.parse)def parse(self, response):for h3 in response.xpath(//h3).extract():yield MyItem(titleh3)for url in response.xpath(//a/href).extract():yield scrapy.Request(url, callbackself.parse) 例子 我们可能需要在命令行为爬虫程序传递参数比如传递初始的url像这样
#命令行执行
scrapy crawl myspider -a categoryelectronics#在__init__方法中可以接收外部传进来的参数
import scrapyclass MySpider(scrapy.Spider):name myspiderdef __init__(self, categoryNone, *args, **kwargs):super(MySpider, self).__init__(*args, **kwargs)self.start_urls [http://www.example.com/categories/%s % category]#...#注意接收的参数全都是字符串如果想要结构化的数据你需要用类似json.loads的方法 参数传递 6、其他通用Spidershttps://docs.scrapy.org/en/latest/topics/spiders.html#generic-spiders 六 Selectors #1 //与/
#2 text
#3、extract与extract_first:从selector对象中解出内容
#4、属性xpath的属性加前缀
#4、嵌套查找
#5、设置默认值
#4、按照属性查找
#5、按照属性模糊查找
#6、正则表达式
#7、xpath相对路径
#8、带变量的xpath response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()#1 //与/
response.xpath(//body/a/)#
response.css(div a::text) response.xpath(//body/a) #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
[]response.xpath(//body//a) #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
[Selector xpath//body//a dataa hrefimage1.htmlName: My image 1 , Selector xpath//body//a dataa hrefimage2.htmlName: My image 2 , Selector xpath//body//a dataa href
image3.htmlName: My image 3 , Selector xpath//body//a dataa hrefimage4.htmlName: My image 4 , Selector xpath//body//a dataa hrefimage5.htmlName: My image 5 ]#2 textresponse.xpath(//body//a/text())response.css(body a::text)#3、extract与extract_first:从selector对象中解出内容response.xpath(//div/a/text()).extract()
[Name: My image 1 , Name: My image 2 , Name: My image 3 , Name: My image 4 , Name: My image 5 ]response.css(div a::text).extract()
[Name: My image 1 , Name: My image 2 , Name: My image 3 , Name: My image 4 , Name: My image 5 ] response.xpath(//div/a/text()).extract_first()
Name: My image 1 response.css(div a::text).extract_first()
Name: My image 1 #4、属性xpath的属性加前缀response.xpath(//div/a/href).extract_first()
image1.htmlresponse.css(div a::attr(href)).extract_first()
image1.html#4、嵌套查找response.xpath(//div).css(a).xpath(href).extract_first()
image1.html#5、设置默认值response.xpath(//div[idxxx]).extract_first(defaultnot found)
not found#4、按照属性查找
response.xpath(//div[idimages]/a[hrefimage3.html]/text()).extract()
response.css(#images a[hrefimage3.html]/text()).extract()#5、按照属性模糊查找
response.xpath(//a[contains(href,image)]/href).extract()
response.css(a[href*image]::attr(href)).extract()response.xpath(//a[contains(href,image)]/img/src).extract()
response.css(a[href*imag] img::attr(src)).extract()response.xpath(//*[hrefimage1.html])
response.css(*[hrefimage1.html])#6、正则表达式
response.xpath(//a/text()).re(rName: (.*))
response.xpath(//a/text()).re_first(rName: (.*))#7、xpath相对路径resresponse.xpath(//a[contains(href,3)])[0]res.xpath(img)
[Selector xpathimg dataimg srcimage3_thumb.jpg]res.xpath(./img)
[Selector xpath./img dataimg srcimage3_thumb.jpg]res.xpath(.//img)
[Selector xpath.//img dataimg srcimage3_thumb.jpg]res.xpath(//img) #这就是从头开始扫描
[Selector xpath//img dataimg srcimage1_thumb.jpg, Selector xpath//img dataimg srcimage2_thumb.jpg, Selector xpath//img dataimg srcimage3_thumb.jpg, Selector xpa
th//img dataimg srcimage4_thumb.jpg, Selector xpath//img dataimg srcimage5_thumb.jpg]#8、带变量的xpathresponse.xpath(//div[id$xxx]/a/text(),xxximages).extract_first()
Name: My image 1 response.xpath(//div[count(a)$yyy]/id,yyy5).extract_first() #求有5个a标签的div的id
images View Code https://docs.scrapy.org/en/latest/topics/selectors.html 七 Items https://docs.scrapy.org/en/latest/topics/items.html 八 Item Pipeline #一可以写多个Pipeline类
#1、如果优先级高的Pipeline的process_item返回一个值或者None会自动传给下一个pipline的process_item,
#2、如果只想让第一个Pipeline执行那得让第一个pipline的process_item抛出异常raise DropItem()#3、可以用spider.name 爬虫名 来控制哪些爬虫用哪些pipeline二示范
from scrapy.exceptions import DropItemclass CustomPipeline(object):def __init__(self,v):self.value vclassmethoddef from_crawler(cls, crawler):Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完成实例化val crawler.settings.getint(MMMM)return cls(val)def open_spider(self,spider):爬虫刚启动时执行一次print(000000)def close_spider(self,spider):爬虫关闭时执行一次print(111111)def process_item(self, item, spider):# 操作并进行持久化# return表示会被后续的pipeline继续处理return item# 表示将item丢弃不会被后续pipeline处理# raise DropItem() 自定义pipeline #1、settings.py
HOST127.0.0.1
PORT27017
USERroot
PWD123
DBamazon
TABLEgoodsITEM_PIPELINES {Amazon.pipelines.CustomPipeline: 200,
}#2、pipelines.py
class CustomPipeline(object):def __init__(self,host,port,user,pwd,db,table):self.hosthostself.portportself.useruserself.pwdpwdself.dbdbself.tabletableclassmethoddef from_crawler(cls, crawler):Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完成实例化HOST crawler.settings.get(HOST)PORT crawler.settings.get(PORT)USER crawler.settings.get(USER)PWD crawler.settings.get(PWD)DB crawler.settings.get(DB)TABLE crawler.settings.get(TABLE)return cls(HOST,PORT,USER,PWD,DB,TABLE)def open_spider(self,spider):爬虫刚启动时执行一次self.client MongoClient(mongodb://%s:%s%s:%s %(self.user,self.pwd,self.host,self.port))def close_spider(self,spider):爬虫关闭时执行一次self.client.close()def process_item(self, item, spider):# 操作并进行持久化
self.client[self.db][self.table].save(dict(item)) 示范 https://docs.scrapy.org/en/latest/topics/item-pipeline.html 九 Dowloader Middeware class DownMiddleware1(object):def process_request(self, request, spider):请求需要被下载时经过所有下载器中间件的process_request调用:param request: :param spider: :return: None,继续后续中间件去下载Response对象停止process_request的执行开始执行process_responseRequest对象停止中间件的执行将Request重新调度器raise IgnoreRequest异常停止process_request的执行开始执行process_exceptionpassdef process_response(self, request, response, spider):spider处理完成返回时调用:param response::param result::param spider::return: Response 对象转交给其他中间件process_responseRequest 对象停止中间件request会被重新调度下载raise IgnoreRequest 异常调用Request.errbackprint(response1)return responsedef process_exception(self, request, exception, spider):当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return: None继续交给后续中间件处理异常Response对象停止后续process_exception方法Request对象停止中间件request将会被重新调用下载return None 下载器中间件 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html class DownMiddleware1(object):staticmethoddef get_proxy():return requests.get(http://127.0.0.1:5010/get/).textstaticmethoddef delete_proxy(proxy):requests.get(http://127.0.0.1:5010/delete/?proxy{}.format(proxy))def process_request(self, request, spider):请求需要被下载时经过所有下载器中间件的process_request调用:param request::param spider::return:None,继续后续中间件去下载Response对象停止process_request的执行开始执行process_responseRequest对象停止中间件的执行将Request重新调度器raise IgnoreRequest异常停止process_request的执行开始执行process_exceptionif not hasattr(DownMiddleware1,proxy_addr):DownMiddleware1.proxy_addr self.get_proxy()request.meta[download_timeout] 5request.meta[proxy] http:// self.proxy_addrprint(元数据,request.meta)if request.meta.get(depth) 10 or request.meta.get(retry_times) 2:request.meta[depth] 0request.meta[retry_times]0self.delete_proxy(self.proxy_addr)DownMiddleware1.proxy_addrself.get_proxy()request.meta[proxy] http:// self.proxy_addrprint(,request.meta)return requestreturn None View Code 十 Spider Middleware class SpiderMiddleware(object):def process_spider_input(self,response, spider):下载完成执行然后交给parse处理:param response: :param spider: :return: passdef process_spider_output(self,response, result, spider):spider处理完成返回时调用:param response::param result::param spider::return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)return resultdef process_spider_exception(self,response, exception, spider):异常调用:param response::param exception::param spider::return: None,继续交给后续中间件处理异常含 Response 或 Item 的可迭代对象(iterable)交给调度器或pipelinereturn Nonedef process_start_requests(self,start_requests, spider):爬虫启动时调用:param start_requests::param spider::return: 包含 Request 对象的可迭代对象return start_requests 爬虫中间件 https://docs.scrapy.org/en/latest/topics/spider-middleware.html 十一 settings.py # -*- coding: utf-8 -*-# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html# 1. 爬虫名称
BOT_NAME step8_king# 2. 爬虫应用路径
SPIDER_MODULES [step8_king.spiders]
NEWSPIDER_MODULE step8_king.spiders# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT step8_king (http://www.yourdomain.com)# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS 4# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY 2# The download delay setting will honor only one of:
# 7. 单域名访问并发数并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN 2
# 单IP访问并发数如果有值则忽略CONCURRENT_REQUESTS_PER_DOMAIN并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP 3# Disable cookies (enabled by default)
# 8. 是否支持cookiecookiejar进行操作cookie
# COOKIES_ENABLED True
# COOKIES_DEBUG True# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息操作爬虫等...
# 使用telnet ip port 然后通过命令操作
# TELNETCONSOLE_ENABLED True
# TELNETCONSOLE_HOST 127.0.0.1
# TELNETCONSOLE_PORT [6023,]# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS {
# Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,
# Accept-Language: en,
# }# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES {
# step8_king.pipelines.JsonPipeline: 700,
# step8_king.pipelines.FilePipeline: 500,
# }# 12. 自定义扩展基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS {
# # step8_king.extensions.MyExtension: 500,
# }# 13. 爬虫允许的最大深度可以通过meta查看当前深度0表示无深度
# DEPTH_LIMIT 3# 14. 爬取时0表示深度优先Lifo(默认)1表示广度优先FiFo# 后进先出深度优先
# DEPTH_PRIORITY 0
# SCHEDULER_DISK_QUEUE scrapy.squeue.PickleLifoDiskQueue
# SCHEDULER_MEMORY_QUEUE scrapy.squeue.LifoMemoryQueue
# 先进先出广度优先# DEPTH_PRIORITY 1
# SCHEDULER_DISK_QUEUE scrapy.squeue.PickleFifoDiskQueue
# SCHEDULER_MEMORY_QUEUE scrapy.squeue.FifoMemoryQueue# 15. 调度器队列
# SCHEDULER scrapy.core.scheduler.Scheduler
# from scrapy.core.scheduler import Scheduler# 16. 访问URL去重
# DUPEFILTER_CLASS step8_king.duplication.RepeatUrl# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
17. 自动限速算法from scrapy.contrib.throttle import AutoThrottle自动限速设置1. 获取最小延迟 DOWNLOAD_DELAY2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY4. 当请求下载完成后获取其连接时间 latency即请求连接到接受到响应头之间的时间5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCYtarget_delay latency / self.target_concurrencynew_delay (slot.delay target_delay) / 2.0 # 表示上一次的延迟时间new_delay max(target_delay, new_delay)new_delay min(max(self.mindelay, new_delay), self.maxdelay)slot.delay new_delay
# 开始自动限速
# AUTOTHROTTLE_ENABLED True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY 1.0# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG True# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
18. 启用缓存目的用于将已经发送的请求或相应缓存下来以便以后使用from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddlewarefrom scrapy.extensions.httpcache import DummyPolicyfrom scrapy.extensions.httpcache import FilesystemCacheStorage# 是否启用缓存策略
# HTTPCACHE_ENABLED True# 缓存策略所有请求均缓存下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY scrapy.extensions.httpcache.DummyPolicy
# 缓存策略根据Http响应头Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY scrapy.extensions.httpcache.RFC2616Policy# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS 0# 缓存保存路径
# HTTPCACHE_DIR httpcache# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES []# 缓存存储的插件
# HTTPCACHE_STORAGE scrapy.extensions.httpcache.FilesystemCacheStorage
19. 代理需要在环境变量中设置from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware方式一使用默认os.environ{http_proxy:http://root:woshiniba192.168.11.11:9999/https_proxy:http://192.168.11.11:9999/}方式二使用自定义下载中间件def to_bytes(text, encodingNone, errorsstrict):if isinstance(text, bytes):return textif not isinstance(text, six.string_types):raise TypeError(to_bytes must receive a unicode, str or bytes object, got %s % type(text).__name__)if encoding is None:encoding utf-8return text.encode(encoding, errors)class ProxyMiddleware(object):def process_request(self, request, spider):PROXIES [{ip_port: 111.11.228.75:80, user_pass: },{ip_port: 120.198.243.22:80, user_pass: },{ip_port: 111.8.60.9:8123, user_pass: },{ip_port: 101.71.27.120:80, user_pass: },{ip_port: 122.96.59.104:80, user_pass: },{ip_port: 122.224.249.122:8088, user_pass: },]proxy random.choice(PROXIES)if proxy[user_pass] is not None:request.meta[proxy] to_byteshttp://%s % proxy[ip_port]encoded_user_pass base64.encodestring(to_bytes(proxy[user_pass]))request.headers[Proxy-Authorization] to_bytes(Basic encoded_user_pass)print **************ProxyMiddleware have pass************ proxy[ip_port]else:print **************ProxyMiddleware no pass************ proxy[ip_port]request.meta[proxy] to_bytes(http://%s % proxy[ip_port])DOWNLOADER_MIDDLEWARES {step8_king.middlewares.ProxyMiddleware: 500,}
20. Https访问Https访问时有两种情况1. 要爬取网站使用的可信任证书(默认支持)DOWNLOADER_HTTPCLIENTFACTORY scrapy.core.downloader.webclient.ScrapyHTTPClientFactoryDOWNLOADER_CLIENTCONTEXTFACTORY scrapy.core.downloader.contextfactory.ScrapyClientContextFactory2. 要爬取网站使用的自定义证书DOWNLOADER_HTTPCLIENTFACTORY scrapy.core.downloader.webclient.ScrapyHTTPClientFactoryDOWNLOADER_CLIENTCONTEXTFACTORY step8_king.https.MySSLFactory# https.pyfrom scrapy.core.downloader.contextfactory import ScrapyClientContextFactoryfrom twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)class MySSLFactory(ScrapyClientContextFactory):def getCertificateOptions(self):from OpenSSL import cryptov1 crypto.load_privatekey(crypto.FILETYPE_PEM, open(/Users/wupeiqi/client.key.unsecure, moder).read())v2 crypto.load_certificate(crypto.FILETYPE_PEM, open(/Users/wupeiqi/client.pem, moder).read())return CertificateOptions(privateKeyv1, # pKey对象certificatev2, # X509对象verifyFalse,methodgetattr(self, method, getattr(self, _ssl_method, None)))其他相关类scrapy.core.downloader.handlers.http.HttpDownloadHandlerscrapy.core.downloader.webclient.ScrapyHTTPClientFactoryscrapy.core.downloader.contextfactory.ScrapyClientContextFactory相关配置DOWNLOADER_HTTPCLIENTFACTORYDOWNLOADER_CLIENTCONTEXTFACTORY
21. 爬虫中间件class SpiderMiddleware(object):def process_spider_input(self,response, spider):下载完成执行然后交给parse处理:param response: :param spider: :return: passdef process_spider_output(self,response, result, spider):spider处理完成返回时调用:param response::param result::param spider::return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)return resultdef process_spider_exception(self,response, exception, spider):异常调用:param response::param exception::param spider::return: None,继续交给后续中间件处理异常含 Response 或 Item 的可迭代对象(iterable)交给调度器或pipelinereturn Nonedef process_start_requests(self,start_requests, spider):爬虫启动时调用:param start_requests::param spider::return: 包含 Request 对象的可迭代对象return start_requests内置爬虫中间件scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware: 50,scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware: 500,scrapy.contrib.spidermiddleware.referer.RefererMiddleware: 700,scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware: 800,scrapy.contrib.spidermiddleware.depth.DepthMiddleware: 900,
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES {# step8_king.middlewares.SpiderMiddleware: 543,
}
22. 下载中间件class DownMiddleware1(object):def process_request(self, request, spider):请求需要被下载时经过所有下载器中间件的process_request调用:param request::param spider::return:None,继续后续中间件去下载Response对象停止process_request的执行开始执行process_responseRequest对象停止中间件的执行将Request重新调度器raise IgnoreRequest异常停止process_request的执行开始执行process_exceptionpassdef process_response(self, request, response, spider):spider处理完成返回时调用:param response::param result::param spider::return:Response 对象转交给其他中间件process_responseRequest 对象停止中间件request会被重新调度下载raise IgnoreRequest 异常调用Request.errbackprint(response1)return responsedef process_exception(self, request, exception, spider):当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return:None继续交给后续中间件处理异常Response对象停止后续process_exception方法Request对象停止中间件request将会被重新调用下载return None默认下载中间件{scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware: 100,scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware: 300,scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware: 350,scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware: 400,scrapy.contrib.downloadermiddleware.retry.RetryMiddleware: 500,scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware: 550,scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware: 580,scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware: 590,scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware: 600,scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware: 700,scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware: 750,scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware: 830,scrapy.contrib.downloadermiddleware.stats.DownloaderStats: 850,scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware: 900,}
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES {
# step8_king.middlewares.DownMiddleware1: 100,
# step8_king.middlewares.DownMiddleware2: 500,
# } settings.py 十二 爬取亚马逊商品信息 1、
scrapy startproject Amazon
cd Amazon
scrapy genspider spider_goods www.amazon.cn2、settings.py
ROBOTSTXT_OBEY False
#请求头
DEFAULT_REQUEST_HEADERS {Referer:https://www.amazon.cn/,User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36
}
#打开注释
HTTPCACHE_ENABLED True
HTTPCACHE_EXPIRATION_SECS 0
HTTPCACHE_DIR httpcache
HTTPCACHE_IGNORE_HTTP_CODES []
HTTPCACHE_STORAGE scrapy.extensions.httpcache.FilesystemCacheStorage3、items.py
class GoodsItem(scrapy.Item):# define the fields for your item here like:# name scrapy.Field()#商品名字goods_name scrapy.Field()#价钱goods_price scrapy.Field()#配送方式delivery_methodscrapy.Field()4、spider_goods.py
# -*- coding: utf-8 -*-
import scrapyfrom Amazon.items import GoodsItem
from scrapy.http import Request
from urllib.parse import urlencodeclass SpiderGoodsSpider(scrapy.Spider):name spider_goodsallowed_domains [www.amazon.cn]# start_urls [http://www.amazon.cn/]def __int__(self,keywordNone,*args,**kwargs):super(SpiderGoodsSpider).__init__(*args,**kwargs)self.keywordkeyworddef start_requests(self):urlhttps://www.amazon.cn/s/refnb_sb_noss_1?paramas{__mk_zh_CN: 亚马逊网站,url: search - alias aps,field-keywords: self.keyword}urlurlurlencode(paramas,encodingutf-8)yield Request(url,callbackself.parse_index)def parse_index(self, response):print(解析索引页:%s %response.url)urlsresponse.xpath(//*[contains(id,result_)]/div/div[3]/div[1]/a/href).extract()for url in urls:yield Request(url,callbackself.parse_detail)next_urlresponse.urljoin(response.xpath(//*[idpagnNextLink]/href).extract_first())print(下一页的url,next_url)yield Request(next_url,callbackself.parse_index)def parse_detail(self,response):print(解析详情页:%s %(response.url))itemGoodsItem()# 商品名字item[goods_name] response.xpath(//*[idproductTitle]/text()).extract_first().strip()# 价钱item[goods_price] response.xpath(//*[idpriceblock_ourprice]/text()).extract_first().strip()# 配送方式item[delivery_method] .join(response.xpath(//*[idddmMerchantMessage]//text()).extract())return item5、自定义pipelines
#sql.py
import pymysql
import settingsMYSQL_HOSTsettings.MYSQL_HOST
MYSQL_PORTsettings.MYSQL_PORT
MYSQL_USERsettings.MYSQL_USER
MYSQL_PWDsettings.MYSQL_PWD
MYSQL_DBsettings.MYSQL_DBconnpymysql.connect(hostMYSQL_HOST,portint(MYSQL_PORT),userMYSQL_USER,passwordMYSQL_PWD,dbMYSQL_DB,charsetutf8
)
cursorconn.cursor()class Mysql(object):staticmethoddef insert_tables_goods(goods_name,goods_price,deliver_mode):sqlinsert into goods(goods_name,goods_price,delivery_method) values(%s,%s,%s)cursor.execute(sql,args(goods_name,goods_price,deliver_mode))conn.commit()staticmethoddef is_repeat(goods_name):sqlselect count(1) from goods where goods_name%scursor.execute(sql,args(goods_name,))if cursor.fetchone()[0] 1:return Trueif __name__ __main__:cursor.execute(select * from goods;)print(cursor.fetchall())#pipelines.py
from Amazon.mysqlpipelines.sql import Mysqlclass AmazonPipeline(object):def process_item(self, item, spider):goods_nameitem[goods_name]goods_priceitem[goods_price]delivery_modeitem[delivery_method]if not Mysql.is_repeat(goods_name):Mysql.insert_table_goods(goods_name,goods_price,delivery_mode)6、创建数据库表
create database amazon charset utf8;
create table goods(id int primary key auto_increment,goods_name char(30),goods_price char(20),delivery_method varchar(50)
);7、settings.py
MYSQL_HOSTlocalhost
MYSQL_PORT3306
MYSQL_USERroot
MYSQL_PWD123
MYSQL_DBamazon#数字代表优先级程度1-1000随意设置数值越低组件的优先级越高
ITEM_PIPELINES {Amazon.mysqlpipelines.pipelines.mazonPipeline: 1,
}#8、在项目目录下新建entrypoint.py
from scrapy.cmdline import execute
execute([scrapy, crawl, spider_goods,-a,keywordiphone8]) View Code https://pan.baidu.com/s/1boCEBT1 转载于:https://www.cnblogs.com/buyisan/p/8289624.html