当前位置：首页 > news >正文

怎么做淘宝客优惠券网站wordpress案例制作

news 2025/11/6 9:33:19

怎么做淘宝客优惠券网站,wordpress案例制作,品牌建设包括哪些,东方建设集团有限公司网站爬虫一、介绍 1、什么是爬虫 1.1 爬虫(Spider)的概念爬虫用于爬取数据#xff0c; 又称之为数据采集程序。爬取的数据来源于网络#xff0c;网络中的数据可以是由Web服务器#xff08;Nginx/Apache#xff09;、数据库服务器(MySQL、Redis)、索引库#xff08;Ela…爬虫一、介绍 1、什么是爬虫 1.1 爬虫(Spider)的概念爬虫用于爬取数据又称之为数据采集程序。爬取的数据来源于网络网络中的数据可以是由Web服务器Nginx/Apache、数据库服务器(MySQL、Redis)、索引库ElastichSearch、大数据Hbase/Hive、视频/图片库(FTP)、云存储等(OSS)提供的。爬取的数据是公开的、非盈利的。 1.2 Python爬虫 - 使用Python编写的爬虫脚本(程序) - 可以完成定时、定量、指定目标Web站点的数据爬取。 - 主要使用多单线程/进程、网络请求库、数据解析、数据存储、任务调度等相关技术。 - Python爬虫工程师可以完成接口测试、功能性测试、性能测试和集成测试。2、爬虫与Web后端服务之间的关系爬虫使用网络请求库相当于客户端请求 Web后端服务根据请求响应数据。爬虫即: - 指定URL - 向Web服务器发起HTTP请求 - 正确地接收响应数据 - 然后根据数据的类型Content-Type进行数据的解析及存储。爬虫程序在发起请求前需要伪造浏览器(UA伪装) # UAUser-Agent请求载体的身份标识 # UA检测门户网站的服务器会检测对应请求的载体身份标识如果检测到请求的载体身份标识为某一款浏览器 # 说明该请求是一个正常的请求。但是如果检测到请求的载体身份标识不是基于某一款浏览器的则表示该请求 # 为不正常的请求爬虫则服务器端就很有可能拒绝该次请求。# UA伪装让爬虫对应的请求载体身份标识伪装成某一款浏览器然后再向服务器发起请求响应200的成功率高很多。 3、Python爬虫技术的相关库网络请求 urllib 内置requests / urllib3 第三方tornado client 实现异步请求selenium(UI自动测试)/Splash(基于WebKit内核)动态js渲染appium(手机App 的爬虫或UI测试) 数据解析 re正则xpathbs4jsonRESTful接口数据数据存储: pymysqlmongodbelasticsearch: ES搜索引擎 JavaScriptECMAScriptES脚本 BOM DOM 多任务库多线程 (threading、线程队列 queue协程asynio、 gevent/eventlet 爬虫框架 scrapyscrapy-redis 分布式多机爬虫 4、常见反爬虫的策略 UAUser-Agent策略登录限制Cookie/Token策略请求频次IP代理策略验证码图片-云打码文字或物件图片点选、滑块策略动态jsSelenium/Splash/api接口策略二、爬虫库urllib【重要】 2.1、urllib库 from urllib.request import urlopen# 发起网络请求 resp urllopen(http://www.hao123.com) assert resp.code 200 print(请求成功) # 保存请求的网页 # f 变量接收open()函数返回的对象的__enter__()返回结果 with open(a.html, wb) as f:f.write(resp.read())2.1.1 urllib.request 模块 urlopen(url | request: Request, dataNone) data是bytes类型 from urllib.request import urlopen # 导入urllib库下的request的urlopen模块urlopen(url, dataNone)可以直接发起url的请求, 如果data不为空时则默认是POST请求反之为GET请求。urlopen()返回是response响应对象 urlretrieve(url, filename) 下载url的资源到指定的文件 from urllib.request import urlretrieve, url2pathname # url2pathname通过url生成本地路径 import hashlib import os# 下载图片 def download_img(url):# md5 生成文件名 url获取文件的扩展名filename hashlib.md5(url.encode(utf-8)).hexdigest() os.path.splitext(url2pathname(url))[-1]urlretrieve(url, filename) # 保存文件图片if __name__ __main__:url http://xxx.xxx.xxx/1.png# print(url2pathname(url)) # P:\xxx.xxx.xxx\1.png# print(os.path.splitext(url2pathname(url))[-1]) # 分割出 url文件扩展名 .pngdownload_img(https://timgsa.baidu.com/timg?imagequality80sizeb9999_10000sec1598329718420dif35bcbfa8ce0d360f4ac5ee60efe0b39imgtype0srchttp%3A%2F%2Fe.hiphotos.baidu.com%2Fzhidao%2Fpic%2Fitem%2Fb3119313b07eca8031a31eeb902397dda1448313.jpg) build_opener(*handlder) 构造浏览器对象 opener.open(url|request, dataNone) 发起请求 Request 构造请求的类可以使用这个类来定制一个请求对象来模拟浏览器登录 Request 构造请求类: Request(url, dataNone,headersNone,methodNone) from urllib.request import Request # 导入urllib库下的request的Request模块 from collections import namedtuple# 封装请求对象 request Request(url, headersdefault_headers) # 创建请求实例 resp:HTTPResponse urlopen(request) # 打开请求HTTPHandler HTTP协议请求处理器 ProxyHandler(proxies{‘http’: ‘http://proxy_ip:port’}) 代理处理 HTTPCookieProcessor(CookieJar()) http.cookiejar.CookieJar 类 2.1.2 urllib.parse url 解析 from urllib.parse import quote, urlencodequote() 仅对中文字符串进行url编码(只针对一个汉字进行编码) urlencode(query: dict) 将参数的字典中所有的values转成url编码结果是keyvaluekeyvalue形式即以 application/x-www-form-urlencoded作为url编码类型可以针对多个参数进行编码)。【提示】 query: dict 参数加冒号表示参数的数据类型json上传数据时Content-Type要设置为application/json类型data 请求头的Content-Type默认类型是application/x-www-form-urlencoded 2.1.3 response from http.client import HTTPResponse # 导入http.client库下的HTTPResponse模块response:HTTPResponse urlopen(url, timeout30) # 打开请求response.read() 读取的是二进制数据需要进行转码字节–字符串解码decode 【response.read().decode()】字符串–字节编码encode response.code/ response.status/response.getcode() 获取响应状态码 readline() 读取当前行的数据-文本 readlines() 读取所有行数据-文本 (按行读取字节数据返回list列表[b’’,b’’]) geturl() 请求的url headers/getheaders()/info() 获取响应头 2.1.4 解决SSL问题解决Python低版本对SSL证书的支持 import ssl # 导入ssl包 ssl._create_default_https_context ssl._create_unverified_context # 请求前设置https上下文2.2、示例 spider01.py 需求利用http://hao123.com网页测试相应方法 from http.client import HTTPResponse, HTTPMessage from urllib.request import urlopen, Request, urlretrieve# 解决Python低版本对SSL证书的支持 3.7不用 import ssl ssl._create_default_https_context ssl._create_unverified_contexturl https://www.baidu.com # httpssl证书 (Socket三次握手、SSL的Socket的三次握手) url2 http://hao123.com # HTTP的请求版本(HTTP/1.0、HTTP/1.1、HTTP、2.0)# 发起HTTP的请求 # urlopen()默认请求的方法是GET但data不为空时则表示请求方式是POST response:HTTPResponse urlopen(url2, timeout30) print(type(response)) # 查看响应对象的类型 bytes response.read() # print(response.read().decode()) # 读取响应的字节数据并转成字符串 print(response.code, response.status, response.getcode()) # 获取响应状态码 print(response.geturl()) # 响应成功的路径 print(response.headers)print(**30) print(response.info())print(--*30) print(response.readlines()) # 按行读取字节数据返回list列表[b,b]spider02.py import json from urllib.request import urlopen, Request from urllib.parse import quote, urlencode, urljoinfrom http.client import HTTPResponse, HTTPMessage from collections import namedtuple# 声明不可修改的元组类可以修改属性名来访问不可变的属性 # resp Response(text, body, charset, ...) Response namedtuple(Response,(text,body,charset,mimetype,json,headers, status_code))default_headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0 }def download(url,dataNone,headersNone):if headers:default_headers.update(headers)# 基于Request类封装请求对象request Request(url, datadata, headersdefault_headers) # 如果修改全局变量在修改行之前使用global关键字如 global headersresp: HTTPResponse urlopen(request)bytes resp.read() # 读取字节的响应数据# resp.headers - HTTPMessagecharset resp.headers.get_content_charset() if resp.headers.get_content_charset() else utf-8text bytes.decode(charset)# 创建字典有哪些方式params {body: bytes,charset: charset,mimetype: resp.headers.get_content_type(),headers: dict(resp.headers),text: text,json: json.loads(text) if resp.headers.get_content_type() application/json else {},status_code: resp.status}return Response(**params)def test_hao123():resp download(http://hao123.com)print(resp.status_code)print(resp.headers)print(resp.text)def test_baidu():resp download(https://www.baidu.com)print(resp.status_code)# print(resp.text)with open(baidu.html, w, encodingresp.charset) as f:f.write(resp.text)print(file created)def baidu_search(wd):# quote只针对单一的汉字进行编码转义# urlencode() 可以针对多个参数进行编码以字典的方式传参url https://www.baidu.com/s?wd%s % quote(wd) # escapeheaders {Cookie: BAIDUID75523031597EAA3E8BB1A69893E941A3:FG1; BIDUPSID287D2F5789E4D37673357E46D76A9234; PSTM1597633936; BD_UPN13314752; COOKIE_SESSION15827_2_6_6_7_8_0_1_5_4_268_2_21926_17_0_22_1598118126_1598257071_1598257049%7C9%23865216_4_1598257049%7C2; BDORZB490B5EBF6F3CD402E515D22BCDA1598; H_PS_645EC2e46hskUSTamKGaU0Dz33sTXubuzMrp9hK4vtJVqscDTlEg6LOtPFshX9hc; BDRCVFR[Fc9oatPmwxn]srT4swvGNE6uzdhUL68mv3; BDRCVFR[gltLrB7qNCt]mk3SLVN4HKm; BDRCVFR[S4-dAuiWMmn]I67x6TjHwwYf0; BD_HOME1; delPer0; BD_CK_SAM1; PSIN…V-oTH6aogfJ1O1b4WD4UoDdaEG0PSx8g0Kubrn8EogKKy2OTH9DF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SFJbAtoKD-JKvJfJjkM4rHqR_Lqxby26nCa6T9aJ5nJDoVSITkhM6Keb8sXN5aJxvC5b7PQbovQpP-HJ7wb-ndeqtT3priW43TMTc8Kl0MLpbtbb0xyn_VMM3beMnMBMn8teOnaITg3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRDj0Wej33jaRybTjLHDJKWJbatjrjDCvvQxQcy4LdjG5m2lbifnTHVPJj2CoDhqnqbUJj3Cue3-Aq54RM5eLD2KjtJU3UMM5vQ-OHQfbQ0bbOqP-jW5TaQJuy3R7JOpvwDxnxy5FvQRPH-Rv92DQMVU52QqcqEIQHQT3m5-5bbN3ht6T2-DA_oC8bJKJP; shifen[185669144806_26252]1598257070; BDSVRTM212}resp download(url, headers)if resp.status_code 200:with open(%s.html % wd , w, encodingresp.charset) as f:f.write(resp.text)print(Save %s.html File OK % wd)def baidu_fanyi(word):url https://fanyi.baidu.com/sug # post# urlencode 一次将多个参数进行转义URL编码data urlencode({kw: word}) # kw%E4%BC%AF%E7%88%B5# 发起 post请求resp download(url, data.encode(utf-8))print(resp.status_code)print(resp.json)def xjgg(page1, page_size20):print(--download page %s --- % page)url http://www.ccgp-xinjiang.gov.cn/front/search/categorydata json.dumps({categoryCode: ZcyAnnouncement2,pageNo: page,pageSize: page_size,utm: sites_group_front.5b1ba037.0.0.3e2f7230e5f011ea817d43a31c8c15bd})# post 请求上传的数据是json格式的字符串resp download(url, data.encode(utf-8), headers{Content-Type: application/json;charsetutf-8})print(resp.status_code, resp.mimetype)print(resp.json)# 保存为json文件ret []for item in resp.json[hits][hits]:item[_source][id] item[_id]item[_source][url] urljoin(http://www.ccgp-xinjiang.gov.cn, item[_source][url])ret.append(item[_source])with open(top_%s.json % page, w, encodingresp.charset) as f:json.dump(ret, f)if page 10:xjgg(page1)if __name__ __main__:# test_hao123()# test_baidu()# baidu_search(基督山伯爵)# baidu_fanyi(雎)xjgg() spider03.py 三、requests库 requests库也是一个网络请求库基于urllib和urllib3封装的便捷使用的网络请求库。使用场景 - 接口测试 - 封装基于RESTful的webserver操作如ES搜索引擎(ElasticSearch)的SDK操作 - 第三方网络资源请求(ali云的短信验证码) - 下载站点资源(图片、网页、音频和视频)3.1 安装环境(库) pip install requests -i https://mirrors.aliyun.com/pypi/simple3.2 核心的函数 requests.request() 所有请求方法的基本方法以下是request()方法的参数说明 method: str 指定请求方法 GET, POST, PUT, DELETEOPTIONS,HEAD, url: str 请求的资源接口API在RESTful规范中即是URI(统一资源标签识符) params: dict 用于GET请求的查询参数(Query String params)如/s?wdpython3 data: dict , 用于POST/PUT/PATCH 请求的表单参数(Form Data) 封装到body请求体中请求头的Content-Type默认类型是application/x-www-form-urlencoded。借助urllib.parse.urlencode()方法序列化。 json: dict 用于上传json数据的参数封装到body请求体中请求头的Content-Type默认设置为application/json借助json.dumps()将字典序列化把对象序列化成json字符串。 files: dict, 结构 {‘name’: file-like-object | tuple}, 如果是tuple 则有三种情况 (‘filename’, file-like-object)file-like-object理解为open()方法返回Stream流对象 (‘filename’, file-like-object, content_type) content_type表示打开文件的 mimetype(image/png、image/gif、image/webp矢量图) (‘filename’, file-like-object, content_type, custom-headers) 指定files用于上传文件一般使用post请求默认请求头的Content-Type为multipart/form-data类型。 headers/cookies dict proxies: dict , 设置代理 auth: tuple , 用于授权的用户名和口令形式(‘username’, ‘pwd’) requests.get() 发起GET请求查询数据可用参数 urlparams 请求路径里带有参数时使用jsonheaders/cookies/auth # 百度搜索 import requestsdef search(wd):headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0,Cookies: BAIDUIDBB1900CD6732DD7E2D3AE34543595BDC:FG1; BIDUPSIDBB1900CD6732DD7E94833ED9AE81FDDE; PSTM1597146175; BDORZFFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[Fc9oatPmwxn]srT4swvGNE6uzdhUL68mv3; delPer0; PSINO1; H_PS_PSSID1440_32621_32328_32348_32497_32481; BDRCVFR[dG2JNJb_ajR]mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]mk3SLVN4HKm}resp requests.get(http://www.baidu.com/s, params{wd:wd}, headersheaders)print(resp.status_code)print(resp.text)with open(%s.html % wd, wb) as f:f.write(resp.content)if __name__ __main__:search(python3.8)requests.post() 发起POST请求上传/添加数据可用参数 urldata/filesjsonheaders/cookies/auth # 百度翻译 import requestsdef fanyi(kw):url https://fanyi.baidu.com/sugresp requests.post(url, data{kw:kw})# response响应对象存在属性:status_code/headers/cookies/encode/text/content/json()print(resp.json())if __name__ __main__:fanyi(good)requests.put() 发起PUT请求修改或更新数据 requests.patch() HTTP幂等性的问题可能会出现重复处理不建议使用。用于更新数据 requests.delete() 发起DELETE请求删除数据 3.3 requests.Response 以上的请求方法返回的对象类型是Response 对象常用的属性如下 status_code 响应状态码 url 请求的url headers : dict 响应的头相对于urllib的响应对象的getheaders()但不包含cookie。 cookies 可迭代的对象元素是Cookie类对象name, value, path text : 响应的文本信息 content: 响应的字节数据 encoding: 响应数据的编码字符集如utf-8, gbk, gb2312 json(): 如果响应数据类型为application/json则将响应的数据进行反序化成python的list或dict对象。扩展-javascript的序列化和反序列化 JSON.stringify(obj) 序列化 JSON.parse(text) 反序列化四、数据解析方式之re正则字符的表示 . 任意一个字符除了换行[a-f] 范围内的任意一个字符\w 字母、数字和下划线组成的任意的字符\W\d\D\s\S 量词数量表示 * 0或多个 1或多个? 0 或 1 个{n} n 个{n,} 至少n个{n, m} n~m个分组表示 ( ) 普通的分组表示多个正则分组时 search().groups() 返回是元组 (?Pname 字符数量)带有名称的分组多个正则分组时search().groupdict()返回是字典, 字典的key即是分组名。 Python中的正则模块 re.compile() 一次生成正则对象可以多次匹配查询re.match(正则对象字符串)re.search()re.findall()re.sub()re.split() 4.1 示例糗事百科糗图爬取 import requests import re import osif __name__ __main__:headers {User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36}# 创建文件夹保存所有图片if not os.path.exists(./qiutuLibs):os.mkdir(./qiutuLibs)# 设置通用的url模板url https://www.qiushibaike.com/imgrank/page/%d/for pageNum in range(1,3):# 对应页码的urlnew_url format(url%pageNum)# 使用通用爬虫对url对应的一整张页面进行爬取page_text requests.get(urlnew_url, headersheaders).text# 使用聚焦爬虫将页面中所有的糗图进行解析/提取ex div classthumb.*?img src(.*?) alt.*?/divimg_src_list re.findall(ex,page_text,re.S)# print(img_src_list)for src in img_src_list:# 拼接处一个完整的路径src https: src# 请求到了图片的二进制数据img_data requests.get(urlsrc, headersheaders).content# 生成图片名称img_name src.split(/)[-1]# 图片存储的路径imgPath ./qiutuLibs/ img_namewith open(imgPath, wb) as fp:fp.write(img_data)print(img_name, 下载成功)五、BS4(BeatifulSoup) 5.1. 介绍什么是BeatifulSoup BeautifulSoup和lxml一样是一个html的解析器主要功能也是解析和提取数据官网 https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html优缺点: 效率没有lxml的效率高接口设计人性化使用方便5.2. 使用 5.2.1. 数据解析原理 1、标签定位 2、提取标签、标签属性中存储的数据值 - bs4 数据解析的原理- 1、实例化一个BeautifulSoup对象并将页面源码数据加载到该对象中- 2、通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取环境安装 - pip install bs4 - pip install lxml(xpath中也要使用)导入BeautifulSoup包 from bs4 import BeautifulSoup创建对象(对象实例化): 网上文件生成对象:soup BeautifulSoup(网上下载的字符串, lxml) 本地文件生成对象:soup BeautifulSoup(open(1.html), lxml)5.2.2. 数据解析的方法和属性 soup.tagname 返回的是html中第一次出现的tagname标签 soup.find(): - find(tagName) soup.div - 属性定位- soup.find(div, class_/id/attrxxx)- soup.find_all(tagName) 返回符合要求的所有标签(列表)select: - select(某种选择器 (id,class,标签...选择器)) 返回的是一个列表 - 层级选择器- soup.select(.xxx ul li a): 表示的是一个层级(父子标签)- soup.select(.xxx ul a): 空格表示的是多个层级(不是相邻的)获取标签之间的文本数据 - soup.a.text/string/get_text()- text/get_text():可以获取某一个标签中所有的文本内容- string只可以获取该标签下面直系的文内容获取标签的属性值 -soup.a[href]5.2.3 案例示例需求爬取三国演义小说所有的章节标题和章节内容 #!/usr/bin/env python # -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup #需求爬取三国演义小说所有的章节标题和章节内容http://www.shicimingju.com/book/sanguoyanyi.html if __name__ __main__:#对首页的页面数据进行爬取headers {User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36}url http://www.shicimingju.com/book/sanguoyanyi.htmlpage_text requests.get(urlurl,headersheaders).text#在首页中解析出章节的标题和详情页的url#1.实例化BeautifulSoup对象需要将页面源码数据加载到该对象中soup BeautifulSoup(page_text,lxml)#解析章节标题和详情页的urlli_list soup.select(.book-mulu ul li)fp open(./sanguo.txt,w,encodingutf-8)for li in li_list:title li.a.stringdetail_url http://www.shicimingju.comli.a[href]#对详情页发起请求解析出章节内容detail_page_text requests.get(urldetail_url,headersheaders).text#解析出详情页中相关的章节内容detail_soup BeautifulSoup(detail_page_text,lxml)div_tag detail_soup.find(div,class_chapter_content)#解析到了章节的内容content div_tag.textfp.write(title:content\n)print(title,爬取成功)后期示例 import requests from bs4 import BeautifulSoupfrom utils.ua_ import get_uadef get(url, callbackNone):headers {User-Agent: get_ua()}resp requests.get(url, headersheaders, timeout10)if resp.status_code 200:resp.encoding utf-8 if not resp.encoding else resp.encodingif callback is None:parse(url, resp.text)else:callback(url, resp.text)def start_spider():get(http://www.qiushibaike.com/text)def parse(url, html):# print(html)soup BeautifulSoup(html, lxml)# 获取所有的文章div classarticle block untagged mb15 typs_hot ...for article in soup.find_all(div, class_article):# article: bs4.element.Tag类实例print(article.get(class), type(article))img_src https:article.find(img).get(src)# print(img_src)# 获取详情页面的URLinfo_url http://www.qiushibaike.comarticle.find(a, class_contentHerf).get(href)print(info_url)get(info_url, parse_content) # 发起请求# 任务3获取下一页的URL并请求下一页# next_url # get(next_url)def parse_content(url, html):soup BeautifulSoup(html, lxml)#select()方法返回是 class bs4.element.ResultSet 实例可以索引下标操作或迭代author_a soup.select(.side-left-userinfo)[0]img_tag author_a.find(img)author_img https: img_tag.get(src)author_name img_tag.get(alt) # author_a.find(span).string # .text / .get_text()content soup.find(div, class_content)item_pipeline(dict(author_imgauthor_img, nameauthor_name, contentcontent.text))def item_pipeline(item):print(item)# 任务2保存起来if __name__ __main__:start_spider()六、数据解析方式之xpath xpath属于xml/html解析数据的一种方式基于元素Element的树形结构(Node Element)。选择某一元素时根据元素的路径选择如 /html/head/title获取title标签。 - XML 被设计用来传输和存储数据。XML 指可扩展标记语言EXtensible Markup Language - HTML 被设计用来显示数据。6.1. xpath解析原理 - 1、实例化一个etree的对象且需要将被解析的页面源码数据加载到该数据对象中 - 2、调用etree对象中的xpath 方法结合着xpath表达式标签的定位和内容的捕获6.2. 环境安装 - pip install lxml6.3. 实例化一个etree对象: - 1、导包from lxml import etree - 2、将本地的html文档中的源码数据加载到etree对象中etree.parse(filePath) - 3、可以将从互联网上获取的源码数据加载到该对象中etree.HTML(page_text)6.4. xpath(‘xpath表达式’) - /:表示的是从根节点开始定位。表示的是一个层级。 - //:表示的是多个层级。可以表示从任意位置开始定位。 - 属性定位//div[classsong] tag[attrNameattrValue] - 索引定位//div[classsong]/p[3] 索引是从1开始的。 - 取文本 - /text() 获取的是标签中直系的文本内容 - //text() 标签中非直系的文本内容所有的文本内容 - 取属性/attrName img/src[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Hx41BC9a-1602569751032)(E:\07-notes\picture\04-路径查询.png)] 6.5. 示例1 需求爬取58二手房中的房源信息和价格 #!/usr/bin/python3 # 需求爬取58二手房中的房源信息和价格 import requests from lxml import etreeif __name__ __main__:headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0}# 爬取到页面源码数据url https://xa.58.com/ershoufang/page_text requests.get(urlurl, headersheaders).text# 数据解析tree etree.HTML(page_text)# 存储的就是li标签对象li_list tree.xpath(//ul[classhouse-list-wrap]/li)fp open(58二手房, w, encodingutf-8)for li in li_list:# 局部解析title li.xpath(./div[2]/h2/a/text())[0]price li.xpath(./div[3]/p//text())[0]price2 li.xpath(./div[3]/p//text())[1]price3 li.xpath(./div[3]/p//text())[2]fp.write(title \n price price2 price3 \n)print(title \n price, price2, price3)示例2 需求解析下载图片数据 http://pic.netbian.com/4kmeinv/import requests from lxml import etree import osif __name__ __main__:# 创建文件夹保存所有图片if not os.path.exists(E:/03-图片/03-picture/picbz):os.mkdir(E:/03-图片/03-picture/picbz)headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0}# url http://www.netbian.com/meinv/url http://pic.netbian.com/4kmeinv/index_%d.htmlfor pageNum in range(1,10):new_url format(url % pageNum)response requests.get(urlurl, headersheaders)# 手动设定响应数据的编码格式# response.encoding utf-8page_text response.text# 数据解析src的属性值 alt属性tree etree.HTML(page_text)li_list tree.xpath(//ul[classclearfix]/li)for li in li_list:img_src http://pic.netbian.com li.xpath(./a/img/src)[0]img_name li.xpath(./a/img/alt)[0] .jpg# 通用处理中文乱码的解决方案img_name img_name.encode(iso-8859-1).decode(gbk)# 请求图片进行持久化存储img_data requests.get(urlimg_src,headersheaders).contentimg_path E:/03-图片/03-picture/picbz/ img_name# print(img_name, img_src)with open(img_path, wb) as f:f.write(img_data)print(img_name, 下载成功)6.6 示例 6.6.1 要求站长之家照片采集 # xpath 是解决HTML页面中数据的提取 import requests from lxml import etreeurl http://sc.chinaz.com/tupian/resp requests.get(url) if resp.status_code 200:print(--请求成功--)# 开始使用xpath# 1、获取网页根元素resp.encodingutf-8 # 获取文本数据之前可以设置字符集root etree.HTML(resp.text) # 将下载的html文本转成xpath的根元素的Element# with open(a.html, wb) as f:# f.write(resp.content) # 将网页写入文件方便查找标签# 根据属性条件者直接或间接查找子元素img_elements root.xpath(//div[classpic_wrap]//img) # 返回列表类型list[Element img at 0x24f3f4befc8,...]print(img_elements)for img_element in img_elements:print(img_element) # Element img at 0x20fa767ed88print(img_element.get(alt)) # 霸气手持香烟美女图片print(img_element.xpath(./alt)[0]) # 霸气手持香烟美女图片data {}# data[title] img_element.xpath(./alt)[0] # 返回的是list[str,....]# data[src] img_element.xpath(./src2)[0]# data[src], data[title] img_element.xpath(./alt |./src2)data[title] img_element.get(alt) # 获取标签元素的属性data[src] img_element.get(src2)print(data)# 获取下一页的连接# xpath()返回的是list类型page_next_url url root.xpath(//a[classnextpage]/href)[0]print(page_next_url)6.6.2 代码优化 — 脚本封装为函数并采集到下一页数据 import os import random import time from urllib.request import urlretrieveimport requests from lxml import etreefrom utils.es import ESIndexindex ESIndex(chinaz_zhb, 10.36.172.79 , 9200) index.create_index() # 1.开始 def start_spider(url, **kwargs):print(正在下载, url)resp requests.get(url, **kwargs)if resp.status_code 200:print(下载成功, url)# 开始xpath解析封装函数resp.encodingutf-8parse(url, resp.text)print(resp.text) # //div[classpic_wrap]//img# index 1 # 2. 解析函数 def parse(url, html):root etree.HTML(html) # 获取当前网页的根节点# global index# with open(%s.html % index, w, encodingutf-8) as f:# f.write(html)## index 1xpath_str //div[classpic_wrap]//img if url.endswith(/)\else //div[idcontainer]//imgprint(xpath_str)for img in root.xpath(xpath_str): # 返回列表类型list[Element img at 0x24f3f4befc8,...]item {}item[title] img.get(alt) # 获取标签元素的属性item[src] img.get(src2)# 保存数据item_pipeline(item)# 获取下一页的urlif url.endswith(/):base_url urlelse:base_url url[:url.rfind(/)1]next_page_url base_urlroot.xpath(//a[classnextpage]/href)[0]print(next_page_url)time.sleep(random.uniform(0.2, 3.5)) # 休眠随机时间# 开始循环拿数据start_spider(next_page_url, headers{Referer: url,User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0})# 3. 保存数据处理 def item_pipeline(item):print(item)filename item[title] item[src][item[src].rfind(.):] # item[src][item[src].rfind(.):]扩展名download_img(item[src], filename)# 保存数据到ES搜索引擎index.add_doc(images, **item)# 下载图片的方法 def download_img(url, filename):urlretrieve(url, os.path.join(images, filename))if __name__ __main__:start_spider(http://sc.chinaz.com/tupian/)6.6.3 古诗文网站数据采集 # 古诗文网站数据采集 import requests from lxml import etree from lxml.etree import _Elementclass GswSpider():# 起始的urlstart_url [https://so.gushiwen.org/shiwen/]base_url https://so.gushiwen.orgheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0}DEBUG True# 打印函数def log(self, *args, sep\n):if self.DEBUG:print(*args, sepsep)# 启动爬虫def start_spider(self):for url in self.start_url:self.get(url)# 封装请求函数下载def get(self, url, parse_funcNone, **kwargs):resp requests.get(url, headersself.headers) # 发起get请求self.log(url, resp.status_code, sep ) # 打印url和状态码 https://so.gushiwen.org/shiwen/ 200if resp.status_code 200:resp.encoding utf-8 # 字符集处理if parse_func is None:self.parse(url, resp.text)else:parse_func(url, resp.text, **kwargs)# 解析函数def parse(self, url, html):# 解析右侧类型root etree.HTML(html) # 获取当前网页的根节点for a_element in root.xpath(//div[classmain3]/div[last()]/div[1]//a):name a_element.text # 拿到text属性小学古诗href a_element.get(href) # https://so.gushiwen.org/gushi/xiaoxue.aspx# self.log(name, href, sep---) # 小学古诗---https://so.gushiwen.org/gushi/xiaoxue.aspxself.get(href, self.parse_gsw_list) # 第二次发起请求# 解析某一个分类下的所有古诗文def parse_gsw_list(self, url, html):root etree.HTML(html)left_div: _Element root.xpath(//div[classmain3]/div[1])[0]# self.log(type(left_div)) # class lxml.etree._Elementcate_name left_div.xpath(.//h1/text())[0] # 当前标签的文本小学古诗文for subtype_element in left_div.xpath(.//div[classtypecont]):try:subtype_name subtype_element.xpath(.//strong/text())[0] # 一年级上册except:subtype_name # self.log(cate_name, subtype_name, sep) # 小学古诗文一年级上册print(subtype_element.xpath(.//span/a))for span_a in subtype_element.xpath(.//span/a):href span_a.get(href) # https://so.gushiwen.org/shiwenv_ef9cd9ba44bb.aspxprint(href)if href:href href if href.startswith(http) else self.base_url hrefbook_name span_a.text # 诗名江南self.log(cate_name, subtype_name, book_name, href, sep)self.get(href, self.parse_gsw, categorycate_name, subtypesubtype_name) # 第三次发起请求# 解析某一古诗详情def parse_gsw(self, url, html, categoryNone, subtypeNone):self.log(parse_gsw, category, subtype, sep )root etree.HTML(html)# 提取诗文的内容left_div root.xpath(//div[classmain3]/div[1])[0]gsw_element left_div.xpath(./div[2]/div[1])[0]name gsw_element.xpath(.//h1/text())[0]era, author gsw_element.xpath(./p/a/text())content \n.join([c.replace(\n, ).strip() for c in gsw_element.xpath(./div[last()]/text())])# self.log(name, author, era, sep)# print(content)self.item_pipeline(dict(namename, authorauthor, eraera, categorycategory, subtypesubtype, contentcontent))def item_pipeline(self, item):print(item)# 写入到ES中if __name__ __main__:spider GswSpider()# spider.DEBUGFalsespider.start_spider() 【提示】 1、解析类型数据获取标签举例 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P9dVyAhw-1602569751036)(E:\07-notes\picture\06-古诗文采集标签举例.png)] 2、古诗文采集解析某一分类数据标签举例 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VSEskUR9-1602569751039)(E:\07-notes\picture\07-古诗文采集解析某一分类数据标签举例.png)] 6.6.4 协程版古诗文数据采集 # 古诗文网站的数据采集 import requests from lxml import etree from lxml.etree import _Element import asyncioclass GswSpider():start_urls [https://so.gushiwen.org/shiwen/]base_url https://so.gushiwen.orgheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0}DEBUG True# 加装饰器asyncio.coroutinedef log(self, *args, sep\n):if self.DEBUG:print(*args, sepsep)asyncio.coroutinedef start_spider(self):for url in self.start_urls:yield from self.get(url)asyncio.coroutinedef get(self, url, parse_funcNone, **kwargs):resp requests.get(url, headersself.headers)self.log(url, resp.status_code, sep )if resp.status_code 200:resp.encoding utf-8if parse_func is None:yield from self.parse(url, resp.text)else:yield from parse_func(url, resp.text, **kwargs)asyncio.coroutinedef parse(self, url, html):root etree.HTML(html)for a_element in root.xpath(//div[classmain3]/div[last()]/div[1]//a):name a_element.texthref a_element.get(href)yield from self.log(name, href, sep-)yield from self.get(href, self.parse_gsw_list)asyncio.coroutinedef parse_gsw_list(self, url, html):# 解析某一个分类下的所有诗文root etree.HTML(html)left_div: _Element root.xpath(//div[classmain3]/div[1])[0]# self.log(type(left_div))cate_name left_div.xpath(.//h1/text())[0]for subtype_element in left_div.xpath(.//div[classtypecont]):try:subtype_name subtype_element.xpath(.//strong/text())[0]except:subtype_name for span_a in subtype_element.xpath(.//span/a):href span_a.get(href)if href:href href if href.startswith(http) else self.base_url hrefbook_name span_a.textyield from self.log(cate_name, subtype_name, book_name, href, sep)yield from self.get(href, self.parse_gsw, categorycate_name, subtypesubtype_name)asyncio.coroutinedef parse_gsw(self, url, html, categoryNone, subtypeNone):yield from self.log(parse_gsw, category, subtype, sep )root etree.HTML(html)left_div root.xpath(//div[classmain3]/div[1])[0]# 提取诗文内容gsw_element left_div.xpath(./div[2]/div[1])[0]name gsw_element.xpath(.//h1/text())[0]era, author gsw_element.xpath(./p/a/text())content \n.join([c.replace(\n, ).strip() for c in gsw_element.xpath(./div[last()]//text())])# self.log(name, author, era, sep)yield from self.item_pipeline(dict(namename,authorauthor,eraera,categorycategory,subtypesubtype,contentcontent))asyncio.coroutinedef item_pipeline(self, item):print(item)# 写入到ES中if __name__ __main__:spider GswSpider()spider.DEBUGFalse# asyncio.run(spider.start_spider()) # 获取事件模型对象loop asyncio.get_event_loop()loop.run_until_complete(spider.start_spider()) # 单协程对象启动# tasks [spider.start_spider(), spider.start_spider(), spider.start_spider()]# loop.run_until_complete(asyncio.wait(tasks)) # 多协程对象启动# spider.get(https://so.gushiwen.org/shiwenv_e4df1367a39a.aspx, spider.parse_gsw) 七、验证码 7.1 Cookie http/https协议特性无状态。没有请求到对应页面数据的原因发起的第二次基于个人主页页面请求的时候服务器端并不知道该此请求是基于登录状态下的请求。 cookie用来让服务器端记录客户端的相关状态。- 手动处理通过抓包工具获取cookie值将该值封装到headers中。不建议- 自动处理- cookie值的来源是哪里- 模拟登录post请求后由服务器端创建。session会话对象- 作用1.可以进行请求的发送。2.如果请求过程中产生了cookie则该cookie会被自动存储/携带在该session对象中。- 创建一个session对象session requests.Session()- 使用session对象进行模拟登录post请求的发送cookie就会被存储在session中- session对象对个人主页对应的get请求进行发送携带了cookie八、代理代理破解封IP这种反爬机制。什么是代理- 代理服务器。代理的作用- 突破自身IP访问的限制。- 隐藏自身真实IP 代理相关的网站- 快代理- 西祠代理- www.goubanjia.com 代理ip的类型- http应用到http协议对应的url中- https应用到https协议对应的url中代理ip的匿名度- 透明服务器知道该次请求使用了代理也知道请求对应的真实ip- 匿名知道使用了代理不知道真实ip- 高匿不知道使用了代理更不知道真实的ip九、多任务爬虫在爬虫中使用异步实现高性能的数据爬取操作 1. 进程和线程 multiprocessing模块进程 Process 进程类Queue 进程间通信的队列 put(item, timeout)item get(timeout) 进程使用场景服务程序与客户程序分离如mysql服务和mysql客户端、Docker、Redis、ElasticSearch等服务都属于进程使用场景服务框架中使用如scrapy、Django、Flask。分离业务中使用如进程池的定时任务和计划编排等Celery框架使用场景图示 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8uyxkhNQ-1602569751043)(E:\07-notes\picture\08-进程使用场景.png)] threading 模块线程 Thread 线程类线程间通信访问对象 queue.Queue 线程队列回调函数主线程声明子线程调用函数 2.进程 2.1 进程的生命周期 # 进程的生命周期状态 # 1- 创建 Xxx() # 1-2: 启动 .start()# 2- 就绪等待执行# 3- 运行run()方法被执行CPU分配时间执行片 # 3-2: 就绪 CPU执行片用完则进入就绪状态 # 3-5 结束4# 4- 阻塞在run()方法中执行程序的过程遇到IO操作(IO操作文件的读写、网络流的读写、网络请求) # 4-2: 阻塞结束后进入就绪状态# 5- 结束代码运行完毕程序运行结束图示: [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-58e83gBO-1602569751045)(E:\07-notes\picture\11-进程的生命周期图示.png)] 2.2 进程间的通信 # 进程之间的通信方式(进程间的内存是相互独立的) # 1. multiprocessing.Queue 进程队列 # 2. multiprocessing.Pipe 进程管道 # 3. multiprocessing.Manager 共享内存 ,基于C实现的Manager中所指放入数据类型都是c的 # 4. Linux下的socket.AF_UNIX 套接字 # 5. signals 信号(键盘事件监听等)2.3 爬虫进程设计图示 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7XzryDDS-1602569751047)(E:\07-notes\picture\09-爬虫进程设计.png)] 2.4 进程队列 2.4.1 说明使用管道和少量的锁/信号量实现的进程共享的队列当进程首先将一个项目放到队列中时启动一个将线程从缓冲区转移到管道中的Feeder线程 2.4.2 Queue对列的方法 - Queue(maxsize0) 最大进程任务数0 表示不限制 - 输入: put(obj[, block[, timeout]]) - 获取输出: get([block[, timeout]]) - 队列大小 qsize() - 是否为空: empty() - 关闭 close()# 需求 Boss派20项活给5个工人完成 from multiprocessing import Process, Queue import time import osdef boss(q): # 大老板安排任务for i in range(20):msg Boss按排的任务 %d %iq.put(msg) # 如果msg量没有达到最高值则直接存入反之则等待print(time.strftime(%x %X, time.localtime()), msg) # 08/29/20 09:12:06 Boss按排的任务 0time.sleep(0.5)print(--boss任务派发完成----)def worker(q):while True:msg q.get(timeout5) # 5秒内没有消息表示工作完成print(at {} 工人{} 收到: {}.format(time.time(), os.getpid(), msg))time.sleep(2)if __name__ __main__:q Queue(maxsize2) # 最大的消息数量workers []for i in range(5):p Process(targetworker, args(q, ))p.start()workers.append(p) # 将所有工人进程管理起来boss(q) # 老板开始派活for worker in workers:worker.terminate() # 解散工人q.close()print(--工作完成---over--) 2.4.3 晒了吗网站多任务数据爬取 2.5 进程管道 (半/全双工) 管道也叫无名管道它是 UNIX 系统 IPC进程间通信 (Inter-Process Communication 的最古老形式管道用来连接不同进程之间的数据流 pipe.recv() 接收消息 pipe.send(str) 发送消息2.5.1.1 半双工 import time from multiprocessing import Process,current_process as cp,Pipedef send_msg(conn):print(--send msg--,cp().name)time.sleep(3)conn.send((1,2,3)) #管道通信尅发送任何类型对象# conn.send(are you ok?)print(--msg已发送--,cp().name)def receive_msg(conn):print(--receive msg--,cp().name)msgconn.recv() #阻塞到收到消息为止print(cp().name,receives msg-,msg)if __name__ __main__:# duplexFlase 表示单双工# conn1仅接收只读 conn2仅发conn1,conn2Pipe(duplexFalse)p1Process(targetsend_msg,args(conn2,))p2 Process(targetreceive_msg, args(conn1,))p2.start()p1.start()p1.join()p2.join()print(--over--)2.5.1.1 全双工 import time from multiprocessing import Process,current_process as cp,Pipedef a(conn):print(cp().name,--a msg--)time.sleep(1)conn.send(are you ok?)print(cp().name,a-b sended msg)msgconn.recv() #等待消息到达print(cp().name,接收到b的消息,msg)def b(conn):print(cp().name,--b msg--)msgconn.recv() #等待消息到达if msg.find(are you ok?)-1:conn.send(ok!)else:conn.send(收到你的消息真开心)if __name__ __main__:# duplexTrue 表示全双工# conn1/conn2 可读可写conn1,conn2Pipe(duplexTrue)p1Process(targeta,args(conn1,))p2 Process(targetb, args(conn2,))p2.start()p1.start()p1.join()p2.join()print(--over--)3. 线程 3.1 线程的概念 1、一个进程里面至少有一个线程进程的概念只是一种抽象的概念真正在CPU上面调度的是进程里的线程 2、线程是真正干活的线程用的是进程里面包含的一堆资源,线程仅仅是一个调度单位不包含资源 3、每一个进程在启动的时候都会默认创建一个线程, 这个线程叫主线程(MainThread) 4、一个进程任务里面可能对应多个分任务如果一个进程里面只开启一个线程的话多个分任务之间实际上是串行的执行效果即一个程序里面只含有一条执行路径 3.2 线程与进程的关系一个程序启动起来以后至少有一个进程这个进程至少有一个线程 3.2.1 功能进程能够完成多任务比如在一台电脑上能够同时运行多个QQ。线程能够完成多任务比如一个QQ中的多个聊天窗口。 3.2.2 定义的不同进程是系统进行资源分配和调度的一个独立单位.线程是进程的一个实体,是CPU调度和分派的基本单位,它是比进程更小的能独立运行的基本单位.线程自己基本上不拥有系统资源,只拥有一点在运行中必不可少的资源(如程序计数器,一组寄存器和栈),但是它可与同属一个进程的其他的线程共享进程所拥有的全部资源。 3.2.3 区别一个程序至少有一个进程,一个进程至少有一个线程. 线程的划分尺度小于进程(资源比进程少)使得多线程程序的并发性高。进程在执行过程中拥有独立的内存单元而多个线程共享内存从而极大地提高了程序的运行效率线程不能够独立执行必须依存在进程中可以将进程理解为工厂中的一条流水线而其中的线程就是这个流水线上的工人优缺点线程和进程在使用上各有优缺点线程执行开销小但不利于资源的管理和保护而进程正相反。开发中多用 --- 多进程协程进程占用资源多效率相对较低便于管理和维护线程占用资源少效率相对较高不便于管理和维护存在一个平衡例开4个进程 — 一般有几个CPU就开几个进程4核双线程就可以开8个进程可以实现并行开400个线程 — 实现并发一个CPU不断切换执行任务 3.3. 同步与异步同步即是指一个进程在执行某个请求的时候若该请求需要一段时间才能返回信息那么这个进程将会一直等待下去直到收到返回信息才继续执行下去。异步与同步相反即进程不需要一直等下去而是继续执行下面的操作不管其他进程的状态。当有消息返回时系统会通知进行处理这样可以提高执行的效率。 3.4. 串行与并发 CPU地位: 无论是串联、并行或并发,在用户看来都是同时运行的不管是进程还是线程都只是一个任务而已真正干活的是CPUCPU来做这些任务而一个cpu单核同一时刻只能执行一个任务串行在执行多个任务时一个任务接着一个任务执行前一任务完成后才能执行下一个任务。并行多个任务同时运行只有具备多个cpu才能实现并行含有几个cpu也就意味着在同一时刻可以执行几个任务并发是伪并行即看起来是同时运行的实际上是单个CPU在多个程序之间来回的切换 3.5. 线程案例 import random import time from threading import Thread, current_thread as ct, Lock from queue import Queueclass DownloadThread(Thread):def __init__(self):super(DownloadThread,self).__init__()def run(self) - None:global sumprint(ct().name, running...)time.sleep(random.uniform(0.5, 3.5))n random.randint(1, 100)# lock 在上下文中使用时# 进入上下文加锁 lock.acquire()# 退出上下文释放锁 lock.release()with lock:print(ct().name, 产生了, n, 当前的sum-, sum)sum ntime.sleep(0.2)print(ct().name, 当前的sum-, sum)if __name__ __main__:sum 100 # 可以在多个线程中使用lock Lock()# 创建了10个线程ts [DownloadThread()for i in range(10)]# 启动线程for t in ts:t.start()# 等待所有线程执行完成for t in ts:t.join() # 阻塞方法3.5.1. 安全锁 - 创建锁lock threading.Lock()lock threading.RLock() - 加锁lock.acquire() - 解锁lock.release()3.5.2. 线程本地变量理解 ThreadLocal 变量它本身是一个全局变量但是每个线程却可以利用它来保存属于自己的私有数据这些私有数据对其他线程也是不可见的一、对 ThreadLocal 的理解ThreadLocal有的人叫它线程本地变量也有的人叫它线程本地存储其实意思一样。　　　ThreadLocal 在每一个变量中都会创建一个副本每个线程都可以访问自己内部的副本变量。二、为什么会出现 ThreadLocal 的技术应用我们知道多线程环境下每一个线程均可以使用所属进程的全局变量。如果一个线程对全局变量进行了修改将会影响到其他所有的线程对全局变量的计算操作从而出现数据混乱即为脏数据。为了避免线程同时对变量进行修改引入了线程同步机制通过互斥锁、条件变量或者读写锁来控制对全局变量的访问。只用全局变量并不能满足多线程环境的需求很多时候线程还需要拥有自己的私有数据这些数据对于其他线程来说是不可见的。因此线程中也可以使用局部变量局部变量只有线程自身可以访问同一个进程下的其他线程不可访问。有时候使用局部变量不太方便因此 Python 还提供了ThreadLocal 变量它本身是一个全局变量但是每个线程却可以利用它来保存属于自己的私有数据这些私有数据对其他线程也是不可见的。ThreadLocal 真正做到了线程之间的数据隔离,将线程的数据进行私有化import time from threading import Thread, current_thread as ct, localclass Download(Thread):def __init__(self, url):super(Download, self).__init__()self.url urldef run(self) - None:# 初始化请求头# 设置代理或Cookie等print(ct().name, ---set user-agent---, self.url)# id(current_thread())# headers 是线程的本地变量添加属性时使用当前的线程对象的ID作为key属性作为value()# headers 本地变量实际的数据结构是 {id(thread):{属性名属性值}}headers.user_agent %s Firefox 66.72 % self.urltime.sleep(1)print(ct().name, headers.user_agent)time.sleep(1)if __name__ __main__:headers local() # 线程的本地变量# Python提供了 threading.local 类将这个类实例化得到一个全局对象# 但是不同的线程使用这个对象存储的数据其它线程不可见(本质上就是不同的线程使用这个对象时为其创建一个独立的字典)。urls (http://www.baidu.com,http://hao123.com,http://jd.com)ts [Download(url)for url in urls]for t in ts:t.start()for t in ts:t.join()# 输出结果每一个线程读到的headers都不同真正做到了线程之间的隔离 # Thread-1 ---set user-agent--- http://www.baidu.com # Thread-2 ---set user-agent--- http://hao123.com # Thread-3 ---set user-agent--- http://jd.com # Thread-1 http://www.baidu.com Firefox 66.72 # Thread-3Thread-2 http://hao123.com Firefox 66.72 # http://jd.com Firefox 66.723.5.3. 线程条件变量条件变量(Condition) 作用实现多线程的数据安全问题当数据不满足条件时可以让线程挂起反之唤醒其他等待线程。内部使用线程锁原理互斥锁主要作用是并行访问共享资源时保护共享资源防止出现脏数据。python 条件变量Condition也需要关联互斥锁同时Condition自身提供了wait/notify/notifyAll方法用于阻塞/通知其他并行线程可以访问共享资源了。可以这么理解Condition提供了一种多线程通信机制假如线程1需要数据那么线程1就阻塞等待这时线程2就去制造数据线程2制造好数据后通知线程1可以去取数据了然后线程1去获取数据用法: 条件变量利用线程间共享的全局变量进行同步的一种机制两个动作 1、一个线程等待条件变量的条件成立而挂起 2、另一个线程使“条件成立”与互斥锁结合使用: - 为了防止竞争防止死锁 - 线程在改变条件状态前必须首先锁住互斥量(Lock) - 把条件变量到等待条件的线程列表上 - 对互斥锁解锁常用函数: # 创建条件变量 cond threading.Condition(threading.Loca())acquire(*args) 线程锁注意线程条件变量中的所有相关函数使用必须在acquire() /release() 内部操作 release() 条件变量解锁 wait([timeout]) 等待唤醒timeout表示超时 notify(n1) 唤醒最大n个等待的线程 notifyAll()、notify_all() 唤醒所有等待的线程示例 import time from threading import Thread, Condition, Lock, current_thread as ct from queue import Queueclass ConQueue(Queue):def __init__(self, maxsize10): # 初始化super(ConQueue, self).__init__(maxsize)self.cond Condition(Lock()) # 条件变量对象传入一把锁def consume(self, **kwargs):# 消费的方法name ct().name %s % ct().ident # 线程IDif self.cond.acquire(): # 加锁while self.empty(): # 如果仓库为空print(name, 当前仓库是空的)self.cond.wait() # 等待、挂起item self.get_nowait() # 不用等待获取数据self.cond.notify() # 唤醒所有等待线程self.cond.release() # 释放锁return itemdef product(self, item):# 生产的方法name ct().name %s % ct().ident # 线程IDif self.cond.acquire():while self.full():print(name, 仓库已满)self.cond.wait()self.put_nowait(item) # 不需要等待存入仓库self.cond.notify() # 唤醒其他消费线程self.cond.release() # 释放锁class ProducterThread(Thread): # 生产线程def __init__(self, con_queue):super(ProducterThread, self).__init__(nameProducyor)self.queue:ConQueue con_queuedef run(self) - None:name ct().name %s % ct().ident # 线程IDglobal numwhile True:with lock:item %s 面包 % numself.queue.product(item)print(name, 生产了, item)num 1time.sleep(2)class ConsumThead(Thread): # 消费线程def __init__(self, con_queue):super(ConsumThead, self).__init__(nameConsumer)self.queue con_queuedef run(self) - None:name ct().name %s % ct().identwhile True:item self.queue.consume() # 可能存在等待状态print(name, 消费了, item)time.sleep(1)def start(*threads):for t in threads:t.start()def join(*threads):for t in threads:t.join()if __name__ __main__:queue ConQueue(20)cs [ConsumThead(queue) for _ in range(5)]ps [ProducterThread(queue) for _ in range(2)]num 1 # 面包的序号lock Lock()start(*cs, *ps) # 可以解包list或tuple元组# start(*ps)join(*cs, *ps)# join(*ps)print(--over--)4. 协程协程是线程的替代品区别在于线程由CPU调度协程由用户程序自己的调度的。协程需要事件监听模型事件循环器它采用IO多路复用原理在多个协程之间进行调度 4.1 协程的定义原理 - 1.协程是以协作式调度的单线程协程又称之为“微线程”它是在一个线程内完成函数或子程序之间调度 - 2.一个函数(子程序)在调用时都是按层级调用如A调用B,B调用C,再依次返回结果 - 3.函数调用是通过栈实现的栈中存放函数的局部变量 - 4.函数或子程序的调用一个入口一次返回调用顺序是明确。 - 5.协程在执行子程序或函数时函数内部是可以中断转而执行别的子程序在适当时再返回接着执行。4.2 协程的事件模型协程的事件模型(IO异步模型) - selector 轮询 - poll 事件回调 - kqueue/epoll 增强式事件回调4.3. 协程的三种方式基于生成器 generator (过渡) yieldsend() Python3 之后引入了 asyncio模块 asyncio.coroutine 协程装饰器可以在函数上使用此装饰器使得函数变成协程对象在协程函数中可以使用yield from 阻塞当前的协程将执行的权限移交给 yield from 之后的协程对象。asyncio.get_event_loop() 获取事件循环模型对象等待所有的协程对象完成之后结束。 Python3.5之后引入两个关键字 async 替代 asyncio.coroutineawait 替代 yield from 协程对象的运行方式 - loop asyncio.get_event_loop() - loop.run_until_comlete(协程对象)- 自定义的协程函数由asyncio.coroutine装饰器装饰的函数即协程对象- 通过asyncio.wait()协程函数封装多个自定义协程对象4.3.1. 基于生成器 # 基于生成器方式实现协程 # yield/send# 斐波那契数列 1,1,2,3,5,8,..... import random import timedef fib(n):a, b 0, 1index 0while index n:wait yield b # 将b值输出给调用者等待调用者输入传入等待时间waitprint(wait, 秒之后将会产出)time.sleep(wait)a, b b, abindex 1def main():f fib(10)n next(f) # 从fib函数中获取数字while True:try:print(---, n)wait_second random.uniform(0.1, 2.0)n f.send(wait_second) # 向fib生成器函数发送等待时间数据,等待fib函数产出下一个数字except: # StopIteration异常生成器产出完成breakif __name__ __main__:# for n in fib(10):# print(n, end )main()4.3.2. 引入asyncio模块 # 斐波那契数列 1,1,2,3,5,8,..... import random import timeimport asyncio from asyncio import coroutine from utils.ua_ import * import requestscoroutine def get(url):print(--正在GET请求--, url)resp requests.get(url, headers{User-Agent: get_ua()})if resp.status_code 200:resp.encoding utf-8items yield from parse(url, resp.text)print(url, 解析完成, items)def download(*urls):# 下载任务的入口函数# 创建异步协程的时间循环模型loop asyncio.get_event_loop() # 获取事件循环器# loop.run(协程对象) 单个协程运行# 生成批量的协程任务并添加事件模型执行# 启动循环知道结束为止loop.run_until_complete(asyncio.wait([get(url)for url in urls]))coroutine def parse(url, html):print(url, 正在解析)# time.sleep() # 当前线程挂起阻塞yield from asyncio.sleep(random.uniform(0.1, 3.0)) # 当前协程挂起阻塞return {url: url,data:time.localtime()}if __name__ __main__:# download(http://www.baidu.com,http://hao123.com,http://jd.com)# 单个运行协程的方式coroutine_1 get(http://www.baidu.com)asyncio.get_event_loop().run_until_complete(coroutine_1)4.3.3. async和await # 斐波那契数列 1,1,2,3,5,8,..... import random import timeimport asyncio from asyncio import coroutine from utils.ua_ import * import requestsasync def get(url):print(--正在GET请求--, url)resp requests.get(url, headers{User-Agent: get_ua()})if resp.status_code 200:resp.encoding utf-8items await parse(url, resp.text)print(url, 解析完成, items)def download(*urls):# 下载任务的入口函数# 创建异步协程的时间循环模型loop asyncio.get_event_loop() # 获取事件循环器# loop.run(协程对象) 单个协程运行# 生成批量的协程任务并添加事件模型执行loop.run_until_complete(asyncio.wait([get(url)for url in urls]))async def parse(url, html):print(url, 正在解析)# time.sleep() # 当前线程挂起阻塞await asyncio.sleep(random.uniform(0.1, 3.0)) # 当前协程挂起阻塞return {url: url,data:time.localtime()}if __name__ __main__:# download(http://www.baidu.com,http://hao123.com,http://jd.com)# 单个运行协程的方式coroutine_1 get(http://www.baidu.com)asyncio.get_event_loop().run_until_complete(coroutine_1)十、selenium Selenium是驱动浏览器chrome, firefox, IE进行浏览器相关操作打开url, 点击网页中按钮功连接、输入文本 10.1.什么是selenium模块基于浏览器自动化的一个模块。支持通过各种driverFirfoxDriverIternetExplorerDriverOperaDriverChromeDriver驱动真实浏览器完成测试 selenium也是支持无界面浏览器操作的。比如说HtmlUnit和PhantomJs。 10.2. 为什么使用selenium 模拟浏览器功能自动执行网页中的js代码实现动态加载页面渲染在浏览器请求服务器的网页时执行页面的js,在js中将数据转成DOM元素HTML标签 UI自动测试定位输入DOM节点点击某一个DOM节点Button/a标签 10.3 使用selenium 安装环境 pip install selenium下载一个浏览器的驱动程序谷歌浏览器 - 下载路径http://chromedriver.storage.googleapis.com/index.html - 驱动程序和浏览器的映射关系http://blog.csdn.net/huilan_same/article/details/51896672导入模块 from selenium import webdriver实例化一个浏览器对象编写基于浏览器自动化的操作代码发起请求get(url) 标签定位find系列的方法标签交互send_keys(‘xxx’) 执行js程序excute_script(‘jsCode’) 前进后退back(),forward() 关闭浏览器quit() 【总结】元素定位 1、find_element_by_id WebElement 元素 2、find_elements_by_name- 根据标签的name属性查找多个Dom元素- form表单中的字段标签都具有name属性- iframe标签具有name属性 3、find_elements_by_xpath- 基于xpath方式查找DOM元素- 依赖lxml库 4、find_elements_by_tag_name- 根据标签名查找多个Dom元素 5、find_elements_by_class_name 6、find_elements_by_css_selector- #id- .class_name- div- div pa 7、find_elements_by_link_text- 根据a标签的文本根据a标签示例需求打开淘宝搜索IPhone再打开百度–回退–前进 # 需求打开淘宝搜索IPhone再打开百度--回退--前进 from selenium import webdriver from time import sleep# 实例化浏览器谷歌 path rD:\01-soft\12-spider_chromedriver\chromedriver.exe bro webdriver.Chrome(executable_pathpath)# 发起请求 bro.get(https://www.taobao.com/)# 标签定位 search_input bro.find_element_by_id(q) # 标签交互 search_input.send_keys(Iphone)# 执行一组js程序 # scrollTo(0,document.body.scrollHeight) scrollTo(x,y)屏幕滚动x表示左右滚动y表示上下滚动 # document.body.scrollHeight表示向下滚动一屏 bro.execute_script(window.scrollTo(0,document.body.scrollHeight)) sleep(2) # 点击搜索按钮 btn bro.find_element_by_css_selector(.btn-search) btn.click()bro.get(https://www.baidu.com) sleep(2) # 回退 bro.back() sleep(2) # 前进 bro.forward()sleep(5) # 退出程序(关闭浏览器) bro.quit()from selenium.webdriver import Chrome, Firefoxpath rD:\01-soft\12-spider_chromedriver\chromedriver.exe chrome Chrome(path) # chrome Firefox(executable_pathrD:\01-soft\12-spider_chromedriver\geckodriver.exe)# 打开必应 chrome.get(http://cn.bing.com)# 截图工具 chrome.save_screenshot(bing.png)chrome.close() # 关闭页签如果浏览器只有一个页签即也会退出浏览器 chrome.quit() # 退出程序(关闭浏览器)10.4. selenium处理iframe - selenium处理iframe- 如果定位的标签存在于iframe标签之中则必须使用switch_to.frame(id)- 动作链拖动from selenium.webdriver import ActionChains- 实例化一个动作链对象action ActionChains(bro)- click_and_holddiv长按且点击操作- move_by_offset(x,y)- perform()让动作链立即执行- action.release()释放动作链对象示例1 from selenium import webdriver from time import sleep #导入动作链对应的类 from selenium.webdriver import ActionChains path rD:\01-soft\12-spider_chromedriver\chromedriver.exe bro webdriver.Chrome(executable_pathpath)bro.get(https://www.runoob.com/try/try.php?filenamejqueryui-api-droppable)#如果定位的标签是存在于iframe标签之中的则必须通过如下操作在进行标签定位 bro.switch_to.frame(iframeResult)#切换浏览器标签定位的作用域 div bro.find_element_by_id(draggable)#动作链 action ActionChains(bro) #点击长按指定的标签 action.click_and_hold(div)for i in range(5):#perform()立即执行动作链操作#move_by_offset(x,y):x水平方向 y竖直方向action.move_by_offset(17,0).perform()sleep(0.5)#释放动作链 action.release()bro.quit()示例2 需求模拟登陆qq空间 from selenium import webdriver from time import sleeppath rD:\01-soft\12-spider_chromedriver\chromedriver.exe bro webdriver.Chrome(executable_pathpath)bro.get(https://qzone.qq.com/)bro.switch_to.frame(login_frame)a_tag bro.find_element_by_id(switcher_plogin) a_tag.click()userName_tag bro.find_element_by_id(u) password_tag bro.find_element_by_id(p) sleep(1) userName_tag.send_keys(1174333100) sleep(1) password_tag.send_keys(zhb1174333100) sleep(1) btn bro.find_element_by_id(login_button) btn.click()sleep(3)bro.quit()10.5. 交互 1.点击 click() 2.输入 send_keys() 3.模拟 JS滚动var q window.document.documentElement.scrollTop10000execute_script() 执行js代码今日头条# ...... # 模拟滚动操作 # document.documentElement 表示当前页面元素指html # 获取窗口高度 print(chrome.get_window_rect(), chrome.get_window_size()) for n in range(50):script var qwindow.document.documentElement.scrollTop%s % ((n1)*500)chrome.execute_script(script)time.sleep(0.5) # ......10.6. 页面异步ajax的解决办法原因由于网页中有ajax的异步执行的js, 导致driver.get()之后查找元素报 NoSuchElementException异常导包: from selenium.webdriver.common.by import By from selenium.webdriver.support import ui from selenium.webdriver.support import expected_conditions as EC解决: # 等待某一个Element出现为止否则一直阻塞下去不过可以设置一个超时时间ui.WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, soupager)))10.7. switch的用法原因: - 当页面中出现对话框 alert,或内嵌窗口iframe - 如果查找的元素节点在alert或iframe中的话则需要切入到alert或iframe中解决: 1. 查找iframe标签对象iframe driver.find_element_by_id(login_frame) 2. 切换到iframe中driver.switch_to.frame(iframe)10.8. 获取浏览器的页签 brower.window_handlers[0] # 第一个页签一般都存在 brower.window_handlers[1] # 如果浏览器打开新的资源在新的页签时可以获取如果不存在第二个页签则会报错退出browser.quit() 案例登录邮箱内嵌窗口 import timefrom selenium.webdriver import Chromechrome Chrome(rD:\01-soft\12-spider_chromedriver\chromedriver.exe)chrome.get(https://mail.qq.com) # 阻塞方法等待网页中的所有js执行完毕# 以下两个输入框是在iframe内嵌窗口中 login_frame chrome.find_element_by_id(login_frame) chrome.switch_to.frame(login_frame) # 切入到内嵌窗口中uesr_input chrome.find_element_by_id(u) pwd_input chrome.find_element_by_id(p)uesr_input.send_keys(123344556qq.com) pwd_input.send_keys(zdx1233456)# 查找登录按钮并点击 chrome.find_element_by_id(login_button).click() 10.9 无头浏览器 from selenium import webdriver from time import sleep #实现无可视化界面 from selenium.webdriver.chrome.options import Options #实现规避检测 from selenium.webdriver import ChromeOptions#实现无可视化界面的操作 chrome_options Options() chrome_options.add_argument(--headless) chrome_options.add_argument(--disable-gpu)#实现规避检测 option ChromeOptions() option.add_experimental_option(excludeSwitches, [enable-automation])#如何实现让selenium规避被检测到的风险 path rD:\01-soft\12-spider_chromedriver\chromedriver.exe bro webdriver.Chrome(path,chrome_optionschrome_options,optionsoption)#无可视化界面无头浏览器 phantomJs bro.get(https://www.baidu.com)print(bro.page_source) sleep(2) bro.quit()10.10 12306模拟登陆 10.10.1. 超级鹰验证码使用十一、scrapy框架 11.1. 介绍什么是scrapy Scrapy是一个为了爬取网站数据提取结构性数据而编写的应用框架。可以应用在包括数据挖掘信息处理或存储历史数据等一系列的程序中。功能高性能的持久化存储异步的数据下载高性能的数据解析分布式官方网站: https://doc.scrapy.org/en/latest/ http://www.scrapyd.cn/doc/ 中文 http://scrapy-chs.readthedocs.io/zh_CN/latest/ 中文11.2. scrapy框架的基本使用 - 环境的安装- mac or linuxpip install scrapy- windows:- pip install wheel- 下载twisted下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted- 安装twistedpip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl- pip install pywin32- pip install scrapy测试在终端里录入scrapy指令没有报错即表示安装成功创建一个工程 - scrapy startproject xxxProcd xxxPro 在spiders子目录中创建一个爬虫文件 - scrapy genspider spiderName www.xxx.com执行工程 - scrapy crawl spiderName11.3. 框架组成 11.3.1. 五个核心 engine 引擎协调其它四个组件之间的联系即与其它四个组件进行通信也是scrapy框架的核心。自动运行无需关注会自动组织所有的请求对象分发给下载器 spider 爬虫类爬虫程序的编写代码所在也是发起请求的开始的位置。spider发起的请求经过engine转入到scheduler中。请求成功之后的数据解析 - scrapy.Spider 普通的爬虫 - scrapy.CrawlSpider- 可设置规则的爬虫类- Rule 规则类 - 开始的函数- start_requests()scheduler 调度器调度所有的请求优先级高则会先执行。当执行某一个请求时由engine转入到downloader中。 donwloader 下载器, 实现请求任务的执行从网络上请求数据将请求到的数据封装成响应对象并将响应的对象返回给engine。engine将数据响应的数据对象以回调接口方式回传给它的爬虫类对象进行解析。 itempipeline 数据管道当spider解析完成后将数据经engine转入到此数据管道。再根据数据类型进行数据处理图片、文本 1. 清理HTML数据 2. 验证爬取的数据(检查item包含某些字段) 3. 查重(并丢弃) 4. 将爬取结果保存到数据库中 5. 对图片数据进行下载scrapy框架逻辑图流程 1、爬虫引擎获得初始请求开始抓取。 2、爬虫引擎开始请求调度程序并准备对下一次的请求进行抓取。 3、爬虫调度器返回下一个请求给爬虫引擎。 4、引擎请求发送到下载器通过下载中间件下载网络数据。 5、一旦下载器完成页面下载将下载结果返回给爬虫引擎。 6、引擎将下载器的响应通过中间件返回给爬虫进行处理。 7、爬虫处理响应并通过中间件返回处理后的items以及新的请求给引擎。 8、引擎发送处理后的items到项目管道然后把处理结果返回给调度器调度器计划处理下一个请求抓取。 9、重复该过程继续步骤1直到爬取完所有的url请求11.3.3. scrapy使用创建项目命令 scrapy startproject 项目名称创建爬虫命令 scrapy genspider 爬虫名域名启动爬虫命令 scrapy crawl 爬虫名调试爬虫命令 scrapy shell urlscrapy shell fetch(url) 目录结构 - spiders- __init__.py- 自定义的爬虫文件.py - __init__.py - items.py定义数据结构的地方是一个继承自scrapy.Item类属性字段的类型是 scrapy.Field()注意 scrapy.Item类实际是一个dict字典所以在spider的parse()函数返回的迭代元素应该是dict字典对象且字典的key与 item的Field()相对应 - middlewares.py中间件, 用于调整业务逻辑 - pipelines.py管道文件,里面只有一个类用于处理下载数据的后续处理 - settings.py配置文件比如是否遵守robots协议User-Agent定义等11.3.3. 爬虫文件解析函数 parse_detail(self, response: Response) 解析数据的回调函数,response保存了下载的数据可以在此函数内对其进行解析通常使用xpath,parse()函数如果有返回值必须返回可迭代的对象 Response的类方法 - selector() - css() 样式选择器 , 返回Selector选择器的可迭代(列表)对象- scrapy.selector.SelectorList 选择器列表- x()/xpath()- scrapy.selector.Selector 选择器- 样式选择器提取属性或文本- ::text 提取文本- ::attr(属性名) 提取属性 - xpath() xpath路径xpath路径同lxml的xpath()写法 - 选择器常用方法- css()/xpath()- extract() 提取选择中所有内容返回是list- extract_first()/get() 提取每个选择器中的内容, 返回是文本- response 是 scrapy.http.response.HtmlResponse类对象 - response.css(.class属性) 拿到class属性的标签 - response.css(.contentHerf::attr(href)) 获取标签的href属性 - response.xpath()scrapy.selector.Selector返回Selector对象内部写法self.selector.xpath() - extract()/getall() Selector对象的方法用于获取Selector对象的内容即提取是Selector对象中的data属性response.xpath(//title/text()).extract() 返回listresponse.css().xpath() 先使用css选择标签元素再通过xpath提取内容 - extract_first()/get()提取第一条内容 - css()方法中字符串#id, .class, div, divp, ::attr(‘属性名’), ::text 标签文本Request类 scrapy.http.Request 请求对象的属性 - url - callback 解释数据的回调函数对象 - headers 请求头 - priority 请求的优先 Request()中的meta属性可以向下一个解析函数传递数据元数据注意meta是dict字典格式value不能是一个引用对象 scrapy 1.5版本 Request()中的priority 请求在scheduler调度器中的优先级值越高级别越高则优先下载 Request()中的dont_filter为False表示过滤重复下载的请求为True则不过滤11.3.4. 示例 # 示例 import scrapy from scrapy import Request from scrapy.http import Response, HtmlResponse from scrapy.selector import SelectorListfrom qsbk import cookie_class TxtSpider(scrapy.Spider): # 继承父类name jokes # 糗事百科的段子allowed_domains [qiushibaike.com] # 限制请求URL中的域host)是否允许下载start_urls [https://www.qiushibaike.com/text/] # 起始请求的url资源列表BASE_URL https://www.qiushibaike.comdef parse(self, response: HtmlResponse):# 获取所有文章for article_div_selector in response.css(.article):author_item article_div_selector.css(.author img)[0].attribauthor_item[name] author_item.pop(alt)author_item[detail_href] article_div_selector.css(.contentHerf::attr(href)).get()# yield author_item# 发起详情的请求# Request()中的meta属性可以向下一个解析函数传递数据元数据# 注意meta是dict字典格式value不能是一个引用对象 scrapy 1.5版本# Request()中的priority 请求在scheduler调度器中的优先级值越高级别越高则优先下载# Request()中的dont_filter为False表示过滤重复下载的请求为True则不过滤yield Request(self.BASE_URL author_item[detail_href],callbackself.parse_detail,headers{Referer: response.url},cookiescookie_.get_cookies(),meta{author: author_item[name],author_head: author_item[src]},priority100,dont_filterFalse)def parse_detail(self, response: Response):# response.request# print(parse_detail--, response.request.meta)# print(parse_detail--, response.meta)item {author: response.meta[author],author_head: response.meta[author_head]}item[title] response.css(.article-title::text).get()item[publish_time] response.css(.stats-time::text).get()item[content] \n.join([c.replace(\xa0, )for c in response.css(.content::text).getall()])yield itemdef parse_test(self, response: HtmlResponse):# scrapy.http.response.html.HtmlResponse# print(type(response), response)# class: contentHerf -a标签# css()/xpath() - scrapy.selector.unified.SelectorList# scrapy.selector.SelectorList/Selector# - xpath()/css() 查询子孙元素# - get()/extract_first()getall()/extract() 提取是Selector对象中的data属性# - attrib 属性方法只有Selector类实例里面存在要求选择的是元素不是元素的属性# css()方法中字符串#id, .class, div, divp, ::attr(‘属性名’), ::text 标签文本a_elements: SelectorList response.css(.contentHerf::attr(href)) # list[Selector,...]author_elements response.css(.author).xpath(.//img)for i, author_element in enumerate(author_elements):item: dict author_element.attrib # 拿到src和alt两个属性的dict# 修改alt key的名称为 nameitem[name] item.pop(alt)item[detail_href] a_elements[i].get()yield item 11.3.5. scrapy shell 终端调试工具 - 终端输入scrapy shell http://www.baidu.com 在终端会得到一个response对象可以直接使用 - response.xpath() 使用xpath路径查询特定元素返回一个selector对象的列表 - response.css() 使用css_selector查询元素返回一个selector对象11.4. scrapy 持久化存储 11.4.1 基于终端命令存储 - 基于终端指令- 要求只可以将parse方法的返回值存储到本地的文本文件中- 注意持久化存储对应的文本文件的类型只可以为json, jsonlines, jl, csv, xml, marshal, pickle- 指令scrapy crawl xxx -o filePath- 好处简介高效便捷- 缺点局限性比较强数据只可以存储到指定后缀的文本文件中11.4.2. 基于管道持久化存储 - 基于管道- 编码流程- 数据解析- 在item类中定义相关的属性- 将解析的数据封装存储到item类型的对象- 将item类型的对象提交给管道进行持久化存储的操作- 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作- 在配置文件中开启管道- 好处- 通用性强。11.4.3. 示例 spider文件— qiubai.py import scrapy from qiubaiPro.items import QiubaiproItemclass QiubaiSpider(scrapy.Spider):name qiubai# allowed_domains [www.xxx.com]start_urls [https://www.qiushibaike.com/text/]# 基于命令存储# def parse(self, response):# #解析作者的名称段子内容# div_list response.xpath(//div[idcontent-left]/div)# all_data [] #存储所有解析到的数据# for div in div_list:# #xpath返回的是列表但是列表元素一定是Selector类型的对象# #extract可以将Selector对象中data参数存储的字符串提取出来# # author div.xpath(./div[1]/a[2]/h2/text())[0].extract()# author div.xpath(./div[1]/a[2]/h2/text()).extract_first()# #列表调用了extract之后则表示将列表中每一个Selector对象中data对应的字符串提取了出来# content div.xpath(./a[1]/div/span//text()).extract()# content .join(content)## dic {# author:author,# content:content# }## all_data.append(dic)### return all_datadef parse(self, response):# 解析作者的名称段子内容div_list response.xpath(//div[idcontent-left]/div)all_data [] # 存储所有解析到的数据for div in div_list:# xpath返回的是列表但是列表元素一定是Selector类型的对象# extract可以将Selector对象中data参数存储的字符串提取出来# author div.xpath(./div[1]/a[2]/h2/text())[0].extract()author div.xpath(./div[1]/a[2]/h2/text() | ./div[1]/span/h2/text()).extract_first()# 列表调用了extract之后则表示将列表中每一个Selector对象中data对应的字符串提取了出来content div.xpath(./a[1]/div/span//text()).extract()content .join(content)item QiubaiproItem()item[author] authoritem[content] contentyield item # 将item提交给了管道items.py 文件 —在item类中定义相关的属性 import scrapyclass QiubaiproItem(scrapy.Item):# define the fields for your item here like:author scrapy.Field()content scrapy.Field()# passpipeline.py文件 import pymysqlclass QiubaiproPipeline(object):fp None#重写父类的一个方法该方法只在开始爬虫的时候被调用一次def open_spider(self,spider):print(开始爬虫......)self.fp open(./qiubai.txt,w,encodingutf-8)#专门用来处理item类型对象#该方法可以接收爬虫文件提交过来的item对象#该方法没接收到一个item就会被调用一次def process_item(self, item, spider):author item[author]content item[content]self.fp.write(author:content\n)return item # 就会传递给下一个即将被执行的管道类def close_spider(self,spider):print(结束爬虫)self.fp.close()#管道文件中一个管道类对应将一组数据存储到一个平台或者载体中 class mysqlPileLine(object):conn Nonecursor Nonedef open_spider(self,spider):self.conn pymysql.Connect(host116.85.7.220,port3307,userroot,passwordroot,dbqiubai,charsetutf8)def process_item(self,item,spider):self.cursor self.conn.cursor()try:self.cursor.execute(insert into qiubai values(%s,%s)%(item[author],item[content]))self.conn.commit()except Exception as e:print(e)self.conn.rollback()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()settings.py文件中开启管道 # ... ITEM_PIPELINES {qiubaiPro.pipelines.QiubaiproPipeline: 300,qiubaiPro.pipelines.mysqlPileLine: 301,#300表示的是优先级数值越小优先级越高 } # ...【扩展】 - 面试题将爬取到的数据一份存储到本地一份存储到数据库如何实现- 管道文件中一个管道类对应的是将数据存储到一种平台- 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接受- process_item中的return item表示将item传递给下一个即将被执行的管道类11.4.4. 全站数据爬取 - 基于Spider的全站数据爬取- 就是将网站中某板块下的全部页码对应的页面数据进行爬取- 需求爬取校花网中的照片的名称- 实现方式- 将所有页面的url添加到start_urls列表不推荐- 自行手动进行请求发送推荐- 手动请求发送- yield scrapy.Request(url,callback):callback专门用做于数据解析示例spider.py文件 import scrapyclass XiaohuaSpider(scrapy.Spider):name xiaohua# allowed_domains [www.xxx.com]start_urls [http://www.521609.com/meinvxiaohua/]#生成一个通用的url模板(不可变)url http://www.521609.com/meinvxiaohua/list12%d.htmlpage_num 2def parse(self, response):li_list response.xpath(//*[idcontent]/div[2]/div[2]/ul/li)for li in li_list:img_name li.xpath(./a[2]/b/text() | ./a[2]/text()).extract_first()print(img_name)if self.page_num 11:new_url format(self.url%self.page_num)self.page_num 1#手动请求发送:callback回调函数是专门用作于数据解析yield scrapy.Request(urlnew_url,callbackself.parse)11.5. 请求传参 - 请求传参- 使用场景如果爬取解析的数据不在同一张页面中。深度爬取- 需求爬取boss的岗位名称岗位描述示例boos.py文件 import scrapy from bossPro.items import BossproItemclass BossSpider(scrapy.Spider):name boss# allowed_domains [www.xxx.com]start_urls [https://www.zhipin.com/job_detail/?querypythoncity101010100industryposition]url https://www.zhipin.com/c101010100/?querypythonpage%dpage_num 2# 回调函数接受item # 详情页数据解析def parse_detail(self, response):item response.meta[item]job_desc response.xpath(//*[idmain]/div[3]/div/div[2]/div[2]/div[1]/div//text()).extract()job_desc .join(job_desc)# print(job_desc)item[job_desc] job_descyield item# 解析首页中的岗位名称def parse(self, response):li_list response.xpath(//*[idmain]/div/div[3]/ul/li)for li in li_list:item BossproItem()job_name li.xpath(.//div[classinfo-primary]/h3/a/div[1]/text()).extract_first()item[job_name] job_name# print(job_name)detail_url https://www.zhipin.com li.xpath(.//div[classinfo-primary]/h3/a/href).extract_first()# 对详情页发请求获取详情页的页面源码数据# 手动请求的发送# 请求传参meta{}可以将meta字典传递给请求对应的回调函数yield scrapy.Request(detail_url, callbackself.parse_detail, meta{item: item})# 分页操作if self.page_num 3:new_url format(self.url % self.page_num)self.page_num 1yield scrapy.Request(new_url, callbackself.parse)11.4. 图片数据 Imagepipeline - 图片数据爬取之ImagesPipeline- 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别- 字符串只需要基于xpath进行解析且提交管道进行持久化存储- 图片xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据- ImagesPipeline- 只需要将img的src的属性值进行解析提交到管道管道就会对图片的src进行请求发送获取图片的二进制类型的数据而且还会帮我们进行持久化存储。- 使用流程- 数据解析图片的地址- 将存储图片地址的item提交到制定的管道类- 在管道文件中自定制一个基于ImagesPipeLine的一个管道类- get_media_request- file_path- item_completed # 将item返回给下一个管道方法- 在配置文件中- 指定图片存储的目录IMAGES_STORE ./imgs_zhb- 指定开启的管道自定制的管道类示例需求爬取站长素材中的高清图片 1、爬虫脚本(解析数据)img.py import scrapy from imgspro.items import ImgsproItemclass ImgSpider(scrapy.Spider):name img# allowed_domains [www.xxx.com]start_urls [http://sc.chinaz.com/tupian/]def parse(self, response):div_list response.xpath(//div[idcontainer]/div)for div in div_list:# 注意使用伪装属性src div.xpath(./div/a/img/src2).extract_first()# print(src)# 实例化item对象item ImgsproItem()item[src] srcyield item # 提交item到管道2、items.py 文件 —在item类中定义相关的属性 import scrapyclass ImgsproItem(scrapy.Item):# define the fields for your item here like:src scrapy.Field()# pass3、在管道文件中自定制一个基于ImagesPipeLine的一个管道类 pipeline.py文件 from scrapy.pipelines.images import ImagesPipeline import scrapy class ImgsPileLine(ImagesPipeline):#就是可以根据图片地址进行图片数据的请求def get_media_requests(self, item, info):yield scrapy.Request(item[src])#指定图片存储的路径def file_path(self, request, responseNone, infoNone):imgName request.url.split(/)[-1]return imgNamedef item_completed(self, results, item, info):return item #返回给下一个即将被执行的管道类4、在配置文件中 settings.py文件 ... LOG_LEVEL ERRORUSER_AGENT Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) # Obey robots.txt rules ROBOTSTXT_OBEY False... ITEM_PIPELINES {imgsPro.pipelines.ImgsPileLine: 300, } ... #指定图片存储的目录 IMAGES_STORE ./imgs_zhb11.5. 中间件中间件- 爬虫中间件- 下载中间件【重要】- 位置引擎和下载器之间- 作用批量拦截到整个工程中所有的请求和响应- 拦截请求- UA伪装:process_request- 代理IP:process_exception:return request- 拦截响应- 篡改响应数据响应对象- 需求爬取网易新闻中的新闻数据标题和内容- 1.通过网易新闻的首页解析出五大板块对应的详情页的url没有动态加载- 2.每一个板块对应的新闻标题都是动态加载出来的动态加载- 3.通过解析出每一条新闻详情页的url获取详情页的页面源码解析出新闻内容11.5.1. 拦截请求 middlewares.py from scrapy import signalsimport randomclass MiddleproDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.# 定义一个User-Agent池user_agent_list [Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1,Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6,Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5,Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3,Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3,Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3,Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3,Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3,Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24]# 定义两个代理PROXY_http [153.180.102.104:80,195.208.131.189:56055,]PROXY_https [120.83.49.90:9000,95.189.112.214:35508,]#拦截请求def process_request(self, request, spider):#UA伪装request.headers[User-Agent] random.choice(self.user_agent_list)#为了验证代理的操作是否生效request.meta[proxy] http://183.146.213.198:80return None#拦截所有的响应def process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn response#拦截发生异常的请求def process_exception(self, request, exception, spider):if request.url.split(:)[0] http:#代理request.meta[proxy] http://random.choice(self.PROXY_http)else:request.meta[proxy] https:// random.choice(self.PROXY_https)return request #将修正之后的请求对象进行重新的请求发送11.5.2. 拦截响应 : 需求爬取网易新闻数据 wangyi.py文件 import scrapy from selenium import webdriver from wangyiPro.items import WangyiproItemclass WangyiSpider(scrapy.Spider):name wangyi# allowed_domains [www.cccom]start_urls [https://news.163.com/]models_urls [] # 存储五个板块对应详情页的url# 解析五大板块对应详情页的url# 实例化一个浏览器对象def __init__(self):self.bro webdriver.Chrome(executable_pathrD:\01-soft\12-spider_chromedriver\chromedriver.exe)def parse(self, response):li_list response.xpath(//*[idindex2016_wrap]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li)alist [3, 4, 6, 7, 8]for index in alist:model_url li_list[index].xpath(./a/href).extract_first()self.models_urls.append(model_url)# 依次对每一个板块对应的页面进行请求for url in self.models_urls: # 对每一个板块的url进行请求发送yield scrapy.Request(url, callbackself.parse_model)# 每一个板块对应的新闻标题相关的内容都是动态加载def parse_model(self, response): # 解析每一个板块页面中对应新闻的标题和新闻详情页# response.xpath()div_list response.xpath(/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div)for div in div_list:title div.xpath(./div/div[1]/h3/a/text()).extract_first()new_detail_url div.xpath(./div/div[1]/h3/a/href).extract_first()item WangyiproItem()item[title] title# 对新闻详情页的url发起请求yield scrapy.Request(urlnew_detail_url, callbackself.parse_detail, meta{item: item})def parse_detail(self, response): # 解析新闻内容content response.xpath(//*[idendText]//text()).extract()content .join(content)item response.meta[item]item[content] contentyield item# 退出浏览器def closed(self, spider):self.bro.quit()middleware.py文件 from scrapy.http import HtmlResponse from time import sleep class WangyiproDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.def process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of# installed downloader middleware will be calledreturn None#该方法拦截五大板块对应的响应对象进行篡改def process_response(self, request, response, spider):#spider爬虫对象bro spider.bro #获取了在爬虫类中定义的浏览器对象#挑选出指定的响应对象进行篡改#通过url指定request#通过request指定responseif request.url in spider.models_urls:bro.get(request.url) #五个板块对应的url进行请求sleep(3)page_text bro.page_source #包含了动态加载的新闻数据#response #五大板块对应的响应对象#针对定位到的这些response进行篡改#实例化一个新的响应对象符合需求包含动态加载出的新闻数据替代原来旧的响应对象#如何获取动态加载出的新闻数据#基于selenium便捷的获取动态加载数据new_response HtmlResponse(urlrequest.url,bodypage_text,encodingutf-8,requestrequest)return new_responseelse:#response #其他请求对应的响应对象return responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassitems.py文件 import scrapyclass WangyiproItem(scrapy.Item):# define the fields for your item here like:title scrapy.Field()content scrapy.Field()settings.py文件 # 打开下载中间件 ... DOWNLOADER_MIDDLEWARES {wangyiPro.middlewares.WangyiproDownloaderMiddleware: 543, }11.5.3. 爬虫中间件 SpiderMiddleware classmethod from_crawler(cls, crawler) 当创建了spider之后创建当前的中间件类实例同时, 连接打开爬虫类的信号处理process_spider_input(self, response, spider): 可以返回 None 和 raise Exception返回None,表示放行,不拦截响应被解析raise Exception 抛出异常,到达了process_spider_exception()方法中process_spider_output(self, response, result, spider)可以返回 item和request默认:　for r in result: yield rprocess_spider_exception(self, response, exception, spider)可以返回 None/Request/Itemprocess_start_requests(self, start_requests, spider)必须返回Request11.5.4. 下载中间件 process_request(self, request,spider): 返回对象 None|Request|Response|raise IgnoreRequest1. 可以返回哪些对象?? 返回None继续处理这个请求或者返回一个响应对象或者返回一个请求对象或者或触发IgnoreRequest2. 什么时候使用此函数下载器向引擎返回响应的时候 process_response(self,request, response, spider)1. 可以返回对象 Request|Response|raise IgnoreRequest2. 使用场景是什么? 从下载器返回响应时调用process_exception(self,request,exception, spider)返回对象None11.6. 总结核心模块和类 scrapy.Spider 普通爬虫类的父类- name 爬虫名, 在scrapy crawl 命令中使用- start_urls 起始的请求URL资源列表- allowed_domains 允许访问的服务器域名列表- start_requests() 方法,爬虫启动后执行的第一个方法(流程中的第一步:发起请求)- logger 当前爬虫的日志记录器- parse() 默认请求成功后,对响应的数据默认解析的方法scrapy.Spider 普通爬虫类的父类- name 爬虫名称- start ——urls 起始的请求URL资源列表scrapy.Request-初始化时参数: url, method, body, encoding, callback, headers, cookiesn dont_filter, priority, meta- meta dict格式, 可以设置proxy 代理scrapy.http.Response/TextResponse/HtmlResponse- status 响应状态码- meta 响应的原信息包含request中的meta信息- url 请求的URL- request 请求对象- headers 响应头- body 字节数据- text 文本数据- css()/xpath() 提取HTML元素信息(基于lxml/bs4) scrapy.Item 类, 类似于dict, 作用解析出不同结构的数据时使用不同的Item类便于数据管道处理。 scrapy.Filed 类用于Item子类中声明字段属性数据属性scrapy.signals 信号- spider_opened 打开爬虫- spider_closed 关闭爬虫- spider_error 爬虫出现异常优先级 - 请求优先级值高优先级大值低优先级小配置settings中的优先级 - 管道优先级值高优先级小值低优先级大 - 中间件优先级值高优先级小值低优先级大十二、规则爬虫 crawlspider CrawlSpider是一个类它的父类就是scrapy.Spider所以CrawlSpider不仅有Spider的功能还有自己独有的功能 CrawlSpider可以定义规则再解析html内容的时候可以根据链接规则提取出指定的链接然后再向这些链接发送请求所以如果有需要跟进链接的需求就可以使用CrawlSpider来实现 12.1. 流程 - CrawlSpider:类Spider的一个子类- 全站数据爬取的方式- 基于Spider手动请求- 基于CrawlSpider- CrawlSpider的使用- 创建一个工程- cd XXX- 创建爬虫文件CrawlSpider- scrapy genspider -t crawl xxx www.xxxx.com- 链接提取器- 作用根据指定的规则allow进行指定链接的提取- 规则解析器- 作用将链接提取器提取到的链接进行指定规则callback的解析#需求爬取sun网站中的编号新闻标题新闻内容标号- 分析爬取的数据没有在同一张页面中。- 1.可以使用链接提取器提取所有的页码链接- 2.让链接提取器提取所有的新闻详情页的链接#需求爬取sun网站中的编号新闻标题新闻内容标号 sun.py文件 import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import SunproItem,DetailItem#需求爬取sun网站中的编号新闻标题新闻内容标号 class SunSpider(CrawlSpider):name sun# allowed_domains [www.xxx.com]start_urls [http://wz.sun0769.com/index.php/question/questionType?type4page]#链接提取器根据指定规则allow正则进行指定链接的提取link LinkExtractor(allowrtype4page\d)link_detail LinkExtractor(allowrquestion/\d/\d\.shtml)rules (#规则解析器将链接提取器提取到的链接进行指定规则callback的解析操作Rule(link, callbackparse_item, followTrue),#followTrue可以将链接提取器继续作用到连接提取器提取到的链接所对应的页面中Rule(link_detail,callbackparse_detail))#http://wz.sun0769.com/html/question/201907/421001.shtml#http://wz.sun0769.com/html/question/201907/420987.shtml#解析新闻编号和新闻的标题#如下两个解析方法中是不可以实现请求传参#如法将两个解析方法解析的数据存储到同一个item中可以以此存储到两个itemdef parse_item(self, response):#注意xpath表达式中不可以出现tbody标签tr_list response.xpath(//*[idmorelist]/div/table[2]//tr/td/table//tr)for tr in tr_list:new_num tr.xpath(./td[1]/text()).extract_first()new_title tr.xpath(./td[2]/a[2]/title).extract_first()item SunproItem()item[title] new_titleitem[new_num] new_numyield item#解析新闻内容和新闻编号def parse_detail(self,response):new_id response.xpath(/html/body/div[9]/table[1]//tr/td[2]/span[2]/text()).extract_first()new_content response.xpath(/html/body/div[9]/table[2]//tr[1]//text()).extract()new_content .join(new_content)# print(new_id,new_content)item DetailItem()item[content] new_contentitem[new_id] new_idyield itemitems.py import scrapyclass SunproItem(scrapy.Item):# define the fields for your item here like:title scrapy.Field()new_num scrapy.Field()class DetailItem(scrapy.Item):new_id scrapy.Field()content scrapy.Field()pipeline.py class SunproPipeline(object):def process_item(self, item, spider):#如何判定item的类型#将数据写入数据库时如何保证数据的一致性if item.__class__.__name__ DetailItem:print(item[new_id],item[content])passelse:print(item[new_num],item[title])return itemsettings.py ... ITEM_PIPELINES {sunPro.pipelines.SunproPipeline: 300, } ...十三、分布式爬虫 - 分布式爬虫- 概念我们需要搭建一个分布式的机群让其对一组资源进行分布联合爬取。- 作用提升爬取数据的效率- 如何实现分布式- 安装一个scrapy-redis的组件- 原生的scarapy是不可以实现分布式爬虫必须要让scrapy结合着scrapy-redis组件一起实现分布式爬虫。- 为什么原生的scrapy不可以实现分布式- 调度器不可以被分布式机群共享- 管道不可以被分布式机群共享- scrapy-redis组件作用- 可以给原生的scrapy框架提供可以被共享的管道和调度器- 实现流程- 创建一个工程- 创建一个基于CrawlSpider的爬虫文件- 修改当前的爬虫文件- 导包from scrapy_redis.spiders import RedisCrawlSpider- 将start_urls和allowed_domains进行注释- 添加一个新属性redis_key sun 可以被共享的调度器队列的名称- 编写数据解析相关的操作- 将当前爬虫类的父类修改成RedisCrawlSpider- 修改配置文件settings- 指定使用可以被共享的管道ITEM_PIPELINES {scrapy_redis.pipelines.RedisPipeline: 400}- 指定调度器# 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化DUPEFILTER_CLASS scrapy_redis.dupefilter.RFPDupeFilter# 使用scrapy-redis组件自己的调度器SCHEDULER scrapy_redis.scheduler.Scheduler# 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据SCHEDULER_PERSIST True- 指定redis服务器- redis相关操作配置- 配置redis的配置文件- linux或者macredis.conf- windows:redis.windows.conf- 代开配置文件修改- 将bind 127.0.0.1进行删除- 关闭保护模式protected-mode yes改为no- 结合着配置文件开启redis服务- redis-server 配置文件- 启动客户端- redis-cli- 执行工程- scrapy runspider xxx.py- 向调度器的队列中放入一个起始的url- 调度器的队列在redis的客户端中- lpush xxx www.xxx.com- 爬取到的数据存储在了redis的proName:items这个数据结构中示例 day10 — dm530项目存储MondoDB —NoSQL仓库.xmind 十四、增量式爬虫增量式爬虫- 概念监测网站数据更新的情况只会爬取网站最新更新出来的数据。- 分析- 指定一个起始url- 基于CrawlSpider获取其他页码链接- 基于Rule将其他页码链接进行请求- 从每一个页码对应的页面源码中解析出每一个电影详情页的URL- 核心检测电影详情页的url之前有没有请求过- 将爬取过的电影详情页的url存储- 存储到redis的set数据结构- 对详情页的url发起请求然后解析出电影的名称和简介- 进行持久化存储movie.py # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rulefrom redis import Redis from moviePro.items import MovieproItem class MovieSpider(CrawlSpider):name movie# allowed_domains [www.ccc.com]start_urls [https://www.4567tv.tv/frim/index1.html]rules (Rule(LinkExtractor(allowr/frim/index1-\d\.html), callbackparse_item, followTrue),)# 创建redis链接对象conn Redis(host127.0.0.1, port6379)#用于解析每一个页码对应页面中的电影详情页的urldef parse_item(self, response):li_list response.xpath(/html/body/div[1]/div/div/div/div[2]/ul/li)for li in li_list:# 获取详情页的urldetail_url https://www.4567tv.tv li.xpath(./div/a/href).extract_first()# 将详情页的url存入redis的set中ex self.conn.sadd(urls, detail_url)if ex 1:print(该url没有被爬取过可以进行数据的爬取)yield scrapy.Request(urldetail_url, callbackself.parst_detail)else:print(数据还没有更新暂无新数据可爬取)# 解析详情页中的电影名称和类型进行持久化存储def parst_detail(self, response):item MovieproItem()item[name] response.xpath(/html/body/div[1]/div/div/div/div[2]/h1/text()).extract_first()item[desc] response.xpath(/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]//text()).extract()item[desc] .join(item[desc])yield itempipelines.py class MovieproPipeline(object):conn Nonedef open_spider(self,spider):self.conn spider.conndef process_item(self, item, spider):dic {name:item[name],desc:item[desc]}# print(dic)self.conn.lpush(movieData,dic)return item

查看全文

http://www.sadfv.cn/news/188112/