个人网站备案模板,怎么建立自己的小程序,自己做链接的网站,网站浏览器兼容性问题本文来自网易云社区作者#xff1a;王涛此处我们给出几个常用的代码例子#xff0c;包括get,post(json,表单),带证书访问#xff1a;Get 请求gen.coroutinedef fetch_url():try:c CurlAsyncHTTPClient() # 定义一个httpclientmyheaders {Host: weixin.…本文来自网易云社区作者王涛此处我们给出几个常用的代码例子包括get,post(json,表单),带证书访问Get 请求gen.coroutinedef fetch_url():try:c CurlAsyncHTTPClient() # 定义一个httpclientmyheaders {Host: weixin.sogou.com,Connection: keep-alive,Cache-Control: max-age0,Upgrade-Insecure-Requests: 1,User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ,Accept: text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,Accept-Encoding: gzip, deflate,Accept-Language: zh-CN,zh;q0.9,en;q0.8}url http://weixin.sogou.com/weixin?type1s_frominputquery%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5ieutf8_sug_n_sug_type_req HTTPRequest(urlurl, methodGET, headersmyheaders, follow_redirectsTrue, request_timeout20, connect_timeout10,proxy_host127.0.0.1,proxy_port8888)response yield c.fetch(req) # 发起请求print response.codeprint response.bodyIOLoop.current().stop() # 停止ioloop线程except:print traceback.format_exc()Fiddler 抓到的报文请求头POST JSON数据请求gen.coroutinedef fetch_url():抓取urltry:c CurlAsyncHTTPClient() # 定义一个httpclientmyheaders {Host: weixin.sogou.com,Connection: keep-alive,Cache-Control: max-age0,Upgrade-Insecure-Requests: 1,User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ,Accept: text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,Accept-Encoding: gzip, deflate,Content-Type: Application/json,Accept-Language: zh-CN,zh;q0.9,en;q0.8}url http://127.0.0.1?type1s_frominputquery%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5ieutf8_sug_n_sug_type_body json.dumps({key1: value1, key2: value2}) # Json格式数据req HTTPRequest(urlurl, methodPOST, headersmyheaders, follow_redirectsTrue, request_timeout20, connect_timeout10,proxy_host127.0.0.1,proxy_port8888,bodybody)response yield c.fetch(req) # 发起请求print response.codeprint response.bodyIOLoop.current().stop() # 停止ioloop线程except:print traceback.format_exc()Fiddler 抓到的报文请求头POST Form表单数据请求gen.coroutinedef fetch_url():抓取urltry:c CurlAsyncHTTPClient() # 定义一个httpclientmyheaders {Host: weixin.sogou.com,Connection: keep-alive,Cache-Control: max-age0,Upgrade-Insecure-Requests: 1,User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ,Accept: text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,Accept-Encoding: gzip, deflate,# Content-Type: Application/json,Accept-Language: zh-CN,zh;q0.9,en;q0.8}import urlliburl http://127.0.0.1?type1s_frominputquery%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5ieutf8_sug_n_sug_type_body urllib.urlencode({key1: value1, key2: value2}) # 封装form表单req HTTPRequest(urlurl, methodPOST, headersmyheaders, follow_redirectsTrue, request_timeout20, connect_timeout10,proxy_host127.0.0.1,proxy_port8888,bodybody)response yield c.fetch(req) # 发起请求print response.codeprint response.bodyIOLoop.current().stop() # 停止ioloop线程except:print traceback.format_exc()Fiddler 抓到的报文请求头添加证书访问def fetch_url():抓取urltry:c CurlAsyncHTTPClient() # 定义一个httpclientmyheaders {Host: www.amazon.com,Connection: keep-alive,Cache-Control: max-age0,Upgrade-Insecure-Requests: 1,User-Agent: (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36),Accept: (text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8),Accept-Encoding: gzip, deflate, br,Accept-Language: zh-CN,zh;q0.9,en;q0.8}import urlliburl https://www.amazon.com/req HTTPRequest(urlurl, methodGET, headersmyheaders, follow_redirectsTrue, request_timeout20, connect_timeout10,proxy_host127.0.0.1,proxy_port8888,ca_certsFiddlerRoot.pem) # 绑定证书response yield c.fetch(req) # 发起请求print response.codeprint response.bodyIOLoop.current().stop() # 停止ioloop线程except:print traceback.format_exc()Fiddler抓到的报文(说明可以正常访问)四、总结抓取量少的时候建议使用requests,简单易用。并发量大的时候建议使用tornado单线程高并发高效易编程。以上给出了requests和Fiddler中常用的接口和参数说明能解决爬虫面对的大部分问题包括并发抓取、日常的反爬应对https网站的抓取。附上一段我自己的常用抓取代码逻辑import randomfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.queues import Queueimport randomfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.queues import QueueTASK_QUE Queue(maxsize1000)def response_handler(res): 处理应答一般会把解析的新的url添加到任务队列中并且解析出目标数据 passgen.coroutinedef url_fetcher_without_param():passgen.coroutinedef url_fetcher(*args,**kwargs):global TASK_QUEc CurlAsyncHTTPClient()while 1:#console_show_log(Lets spider)try:param TASK_QUE.get(time.time() 300) # 5 分钟超时except tornado.util.TimeoutError::yield gen.sleep(random.randint(10,100))continuetry:req HTTPRequest(url,method,headers,....) # 按需配置参数response yield c.fetch(req)if response.coe200:response_handler(response.body)except Exception:yield gen.sleep(10)continuefinally:print I am a slow spideryield gen.sleep(random.randint(10,100))gen.coroutinedef period_callback():passdef main():io_loop IOLoop.current()# 添加并发逻辑1io_loop.spawn_callback(url_fetcher, 1)io_loop.spawn_callback(url_fetcher, 2)io_loop.spawn_callback(url_fetcher_without_param) # 参数是可选的# 如果需要周期调用调用PeriodicCallbackPERIOD_CALLBACK_MILSEC 10 # 10, 单位msio_loop.PeriodicCallback(period_callback,).start()io_loop.start()if __name__ __main__:main()以上欢迎讨论交流五、参考网易云免费体验馆0成本体验20款云产品更多网易研发、产品、运营经验分享请访问网易云社区。