郑州小企业网站建设,郑州建设网站制作公司,仿贴吧的网站,潜江资讯网招聘信息2023年起因: 今天突然想重构一下代理池,并且想扩充一下代理,所以就想着爬点代理IP,然后就有了下面的故事 一上来先进行了一顿操作: def get_xxdaili(url):headers {User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safar…起因: 今天突然想重构一下代理池,并且想扩充一下代理,所以就想着爬点代理IP,然后就有了下面的故事 一上来先进行了一顿操作: def get_xxdaili(url):headers {User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36,Host: www.66ip.cn,Referer: http://www.66ip.cn/index.html,Upgrade-Insecure-Requests: 1,}res requests.get(urlurl, headersheaders) 然后看都没看状态码直接xpath取:过了一会黑人问号??????,喵喵喵,为啥是空,点开源代码,啥都有,哦,可能是xpath写的有问题,又进行了微调,还是取不到,突然感觉这个网站好骚,怎么就取不到呢.有重新分析了一次源代码与Network, 然后看了眼返回状态码,521,进过分析以后得出了问题的原因: 发生 521 错误是因为源服务器拒绝来自 Cloudflare 的连接。更具体地说Cloudflare 尝试通过端口 80 或 443 连接到您的源服务器但却收到连接被拒绝的错误。 我发现cookie的参数很有问题,所以估计是cookie的问题(之前没遇到521,所以一开始也不清楚哪里的问题),网上整理了一下资料,原来是进行了cookie加密(js),所以接下来思路是很清晰了,就是分析js,然后拿到加密后的数据.所以我直接拿到相应信息 scriptvar x27attachEventvarsubstrMayfunction0hrefwhileArrayrOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB63ffromCharCodeGMTreturnpathnametryhvwBxGinnerHTMLYelseeonreadystatechangeatoLowerCase20toStringnew593charAt1reverseExpiresheadlessDevalgYw1558952840firstChildcharCodeAt8parseIntforwindowifjoinfalse11Stringsearchhcaptcha0xFFRegExpDOMContentLoadedhttpsBreplaceJgSe0upZlengthsplitdivaddEventListener__jsl_clearance1500dchallengecatchMon36chars0xEDB88320matchcreateElementPathsetTimeoutdocumentmXg19cookielocation.replace(/*$/,).split(),y4 497(){4l(58.a58.n58.36.3g(/[\\?|]39-47/,\\\\),42);4m.55402c.1m|9|(7(){4 49[[(!~~[])][-~{}-~{}],[ff],[(!~~[])]((-~-~[])*[-~-~[]][][[]][9]),[(!~~[])][-~{}(-~-~[]-~!{})],[(!~~[])][~~{}],[(!~~[])][-~[](-~!{}[(-~~~{}-~~~{})])/[(-~~~{}-~~~{})]],[(!~~[])]((-~-~[]^(!~~[]))[[]][9]),[~~{}],((-~-~[]^(!~~[]))[[]][9]),[-~[](-~!{}[(-~~~{}-~~~{})])/[(-~~~{}-~~~{})]],[(!~~[])][(!~~[])],[-~{}-~{}],[(!~~[])][ff],[(-~-~[]^(!~~[]))(-~-~[]^(!~~[]))(-~-~[]^(!~~[]))],[(!~~[])],((-~-~[])*[-~-~[]][][[]][9]),[-~{}(-~-~[]-~!{})],[(!~~[])][(-~{}(-~-~[]^(!~~[])))],[(-~{}(-~-~[]^(!~~[])))]],cd(49.3j);2k(4 459;4549.3j;45){c[49[45]][%,15,3f,[-~{}(-~-~[]-~!{})][{}[[]][9]][9].22(2g),50,((-~-~[]^(!~~[]))[[]][9]),[-~{}-~{}],2b,[-~{}-~{}],([~~[], ~~[]][]).22(([]))[~~{}]([-~{}-~{}]/(![])[][[]][9]).22(-~-~[](-~-~[])*[-~-~[]])[!{}[][[]][9]][9].22(-~{}-~{}),(2m.27[]).22(-~((-~-~[]^(!~~[])))-~((-~-~[]^(!~~[])))),%,11%,((f)/([])[]).22(~~{})({}[][]).22([(!~~[])][~~{}]),[-~[](-~!{}[(-~~~{}-~~~{})])/[(-~~~{}-~~~{})]][(!~~[])],12,[!-{}[][]][9].22((-~~~{}-~~~{}))[-~{}(-~-~[]-~!{})],28,38][45]};l c.30()})();254b, 1-6-54 32:1:1e j;4k/;};2n((7(){10{l !!2m.3n;}4a(19){l 31;}})()){4m.3n(3d,49,31)}18{4m.3(1a,49)},ffunction(x,y){var a0,b0,c0;xx.split();yy||99;while((ax.shift())(ba.charCodeAt(0)-77.5))c(Math.abs(b)13?(b48.5):parseInt(a,36))y*c;return c},zf(y.match(/\w/g).sort(function(x,y){return f(x)-f(y)}).pop());while(z)try{eval(y.replace(/\b\w\b/g, function(y){return x[f(y,z)-1]||(_y)}));break}catch(_){}/script 经过js优化: scriptvar x DOMContentLoadedPlengthlocationnewaSRegExp19document0pathnamereplace0xEDB883208reverse36tryGMTExpirescookie27StringvwpBTwhilecharsparseIntBpUcharCodeAtreturnfaddEventListenerPathcaptchadivhrefDfunction0xFFifsubstr1attachEventJgSe0upZjoinfirstChildwArraysearchMonelseBqA13fromCharCodeonreadystatechangerOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6charAt09eMay1500matchwindow25httpsvar__jsl_clearance1558945513toLowerCase3false053dgWpLUfor5toStringevalcatchinnerHTMLcreateElementsplit2setTimeoutchallenge.replace(/*$/, ).split(),y 402 241213(){1002(11.20411.3011.300.34(/[\\?|]201-1003/,\\\\),342);22.111403410.422|23|(213(){402 241[((!{})-~(!{})-~[-~([])(-~[]-~(!{}))][][]),(-~[][[]][23])([-~[]-~[]]*((-~{}[-~(!{})]-~(!{})))[][]),((-~([])|1000)[]),[((-~![]-~![])(-~![]-~![]))],(-~[][[]][23])((-~([])|1000)[]),(-~[][[]][23]),(-~(!{})[[]][23]),(-~[][[]][23])(-~[][[]][23]),(-~[][[]][23])((-~[]-~(!{}))[[]][23]),[432],[414(-~![]-~![])(-~![]-~![])],(-~[][[]][23])[432],(-~[][[]][23])[~~{}],(-~[][[]][23])(-~(!{})[[]][23]),((-~[]-~(!{}))[[]][23]),[~~{}],([-~[]-~[]]*((-~{}[-~(!{})]-~(!{})))[][])],31244(241.10);431(402 34023;340241.10;340){31[241[340]][[{}[[]][23]][23].331(([-~(!{})]~~[]-~(!{}))),211,((!{})-~(!{})-~[-~([])(-~[]-~(!{}))][][]),424,[![[]][23]][23].331(-~[]-~[])({}[][[]][23]).331((1000^-~([]))),(-~[][[]][23]),14,([-~[]-~[]]*((-~{}[-~(!{})]-~(!{})))[][]),310%,[!/!/[]][23].331((1000^-~([]))),[414(-~![]-~![])(-~![]-~![])],((-~([])|1000)[]),4,430,120,124,240][340]};133 31.232()})();103303, 113-334-21 332:400:311 102;141/;};220((213(){101{133 !!344.140;}441(333){133 420;}})()){22.140(3,241,420)}304{22.230(320,241)},f function (x, y) {var a 0, b 0, c 0;x x.split();y y || 99;while ((a x.shift()) (b a.charCodeAt(0) - 77.5)) c (Math.abs(b) 13 ? (b 48.5) : parseInt(a, 36)) y * c;return c}, z f(y.match(/\w/g).sort(function (x, y) {return f(x) - f(y)}).pop());
while (z) try {eval(y.replace(/\b\w\b/g, function (y) {return x[f(y, z) - 1] || (_ y)}));break
} catch (_) {
}/script 经过参考资料,和自己的研究,发现关键地方 于是 我把 eval 替换成 console.log 经过整理得到(上图与下面js代码声明的不一样,但是基本上一样,): var _3a function () {setTimeout(location.hreflocation.pathnamelocation.search.replace(/[\?|]captcha-challenge/,\\), 1500);document.cookie __jsl_clearance1558947273.79|0| (function () {var _3a [((-~([]) | 2) []), ((-~[] -~(!{})) [[]][0]), (-~[] [[]][0]) (-~[] [[]][0]), (-~[] [[]][0]) [~~{}], (-~(!{}) [[]][0]) (-~[] [[]][0]), (-~[] [[]][0]) ((-~([]) | 2) []), (-~(!{}) [[]][0]) [~~{}], (-~[] [[]][0]) ([-~[] - ~[]] * ((-~{} [-~(!{})] -~(!{}))) [] []), (-~[] [[]][0]) [3 (-~![] -~![]) (-~![] -~![])], (-~[] [[]][0]) [5], [5], ((!{}) - ~(!{}) - ~[-~([]) (-~[] -~(!{}))] [] []), ([-~[] - ~[]] * ((-~{} [-~(!{})] -~(!{}))) [] []), [3 (-~![] -~![]) (-~![] -~![])], (-~[] [[]][0]) [((-~![] -~![]) (-~![] -~![]))], [~~{}], [((-~![] -~![]) (-~![] -~![]))], (-~[] [[]][0]) ((-~[] -~(!{})) [[]][0]), (-~[] [[]][0]) (-~(!{}) [[]][0]), (-~(!{}) [[]][0]), (-~[] [[]][0]), (-~[] [[]][0]) ((!{}) - ~(!{}) - ~[-~([]) (-~[] -~(!{}))] [] [])],_4h Array(_3a.length);for (var _28 0; _28 _3a.length; _28) {_4h[_3a[_28]] [YM%, (-~(!{}) [[]][0]), xG, [5] [{} [] []][0].charAt(-~[] - ~[]) ([-~[] - ~[]] * ((-~{} [-~(!{})] -~(!{}))) [] []), D, %, ((-~([]) | 2) []), [window[callP hantom] [] [[]][0]][0].charAt((-~![] -~![])) (!![[]][1] [] []).charAt((!{})), T, B, B, BP, ([-~[] - ~[]] * ((-~{} [-~(!{})] -~(!{}))) [] []) [5], %, (!![[]][1] [] []).charAt((!{})), [!{} []][0].charAt(~~) [(!{}) / ~~ [] []][0].charAt(([-~(!{})] ~~[] -~(!{}))) ((-~[] -~(!{})) [[]][0]) [!/!/ [[]][0]][0].charAt(-~[] - ~[]) [{} [] []][0].charAt(-~[] - ~[]), (-~(!{}) [[]][0]), (-~(!{}) [[]][0]), (!{} []).charAt(-~![]), ({} [] [[]][0]).charAt((2 ^ -~([]))) [(!{}) / ~~ [] []][0].charAt((-~![] -~![])), K, k%][_28]};return _4h.join()})() ;ExpiresMon, 27-May-19 09:54:33 GMT;Path/;
};
if ((function () {try {return !!window.addEventListener;} catch (e) {return false;}
})()) {document.addEventListener(DOMContentLoaded, _3a, false)
} else {document.attachEvent(onreadystatechange, _3a)
} 从上面可以看出网站在得到cookie之后又进行了一次加密.所以我们在把上面的代码 document.cookie 中的数据得到就是 我们想要的cookie了 __jsl_clearance1558954345.795|0|V%2Bp1UYNNA%2Fc4wboCF4SQoA%2Fy9j0%3D;ExpiresMon, 27-May-19 11:52:25 GMT;Path/; 这就是我们要得到的数据,在加上第一次我们需要的cookie ,然后将它们进行拼接就是我们要的cookie了, 想要在python下运行js,有很多包,这里我们使用 js2py 与 execjs (这两个都可以) pip install Js2Py pip install PyExecJS 两个代码基本类似,而且由于时间关系,很多地方没有优化,只是实现的功能,希望大家见谅(后期优化) js2py 实现 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# Time : 2019/5/27 15:19
# Author : yhl
# Software: PyCharmimport re
import time
import js2py
import random
import requests
from lxml import etreeheaders {User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36,Host: www.66ip.cn,# Referer: http://www.66ip.cn/index.html,Upgrade-Insecure-Requests: 1,
}def get_521_content(url):req requests.get(urlurl, headersheaders)cookies req.cookiescookies ; .join([.join(item) for item in cookies.items()])txt_521 req.texttxt_521 .join(re.findall(script(.*?)/script, txt_521))return (txt_521, cookies, req)def fixed_fun(function,url):print(function)js function.replace(script, ).replace(/script, ).replace({eval(, {var my_data_1 ()# print(js)# 使用js2py的js交互功能获得刚才赋值的data1对象context js2py.EvalJs()context.execute(js)js_temp context.my_data_1print(js_temp)index1 js_temp.find(document.)index2 js_temp.find(};if(()js_temp js_temp[index1:index2].replace(document.cookie, my_data_2)new_js_temp re.sub(rdocument.create.*?firstChild.href, {}.format(url), js_temp)# print(new_js_temp)# print(type(new_js_temp))context.execute(new_js_temp)data context.my_data_2# print(data)__jsl_clearance str(data).split(;)[0]return __jsl_clearancedef get_66daili(url):txt_521, cookies, req get_521_content(url)print(req.status_code)if req.status_code 521:__jsl_clearance fixed_fun(txt_521,url)headers[Cookie] __jsl_clearance ; cookiesres1 requests.get(urlurl, headersheaders)else:res1 reqres1.encoding gb2312html etree.HTML(res1.text)tr_list html.xpath(//table//tr)for num, tr in enumerate(tr_list, 1):proxy_ip_dict {}if num ! 1:proxy_ip_dict[proxy_ip] .join(tr.xpath(.//td[1]/text()))proxy_ip_dict[proxy_port] .join(tr.xpath(.//td[2]/text()))proxy_ip_dict[proxy_local] .join(tr.xpath(.//td[3]/text()))proxy_ip_dict[proxy_anonymous] .join(tr.xpath(.//td[4]/text()))print(proxy_ip_dict) #proxy_type 网页没有,自己添加代理检测def main():for i in range(1, 2000):get_66daili(http://www.66ip.cn/%s.html % (i))if __name__ __main__:for i in range(2, 2000):get_66daili(http://www.66ip.cn/%s.html % (i))time.sleep(random.uniform(1, 2)) execjs 实现(容易出bug,但是还是可以出来的,亲测) #!/usr/bin/env python
# -*- coding: utf-8 -*-
# Time : 2019/5/27 17:18
# Author : yhl
# Software: PyCharmimport re
import execjs
import js2py
import requestsheaders {User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36
}def get_521_content():req requests.get(http://www.66ip.cn/1.html, headersheaders)cookies req.cookiescookies ; .join([.join(item) for item in cookies.items()])txt_521 req.texttxt_521 .join(re.findall(script(.*?)/script, txt_521))return (txt_521, cookies)def fixed_fun(function):print(function)func_return function.replace(eval, return)resHtml function getClearance(){ func_return };ctx execjs.compile(resHtml)temp1 ctx.call(getClearance)print(temp1)s var a temp1.split(document.cookie)[1].split(Path/;)[0] Path/;;return a;s re.sub(rdocument.create.*?firstChild.href, {}.format(http://www.66ip.cn/1.html), s) print(s---,s)resHtml function getnewClearance(){ s };ctx execjs.compile(resHtml)jsl_clearance ctx.call(getnewClearance)__jsl_clearance str(jsl_clearance).split(;)[0]print(jsl_clearance)return __jsl_clearanceif __name__ __main__:func get_521_content()content func[0]cookie_id func[1]cookie_id1 fixed_fun(content)headers[Cookie] cookie_id ; cookie_id1res1 requests.get(urlhttp://www.66ip.cn/1.html, headersheaders)res1.encoding gb2312print(res1.text) 基于execjs实现结果 转载于:https://www.cnblogs.com/yhll/p/10932349.html