python3 爬取半次元cosplay圖片-知識星球

作者：葉耀榮
源自：https://zhuanlan.zhihu.com/p/42728565

不知道有沒有小哥哥、小姐姐喜歡cosplay的，今天小編就分享一個關於爬取cosplay圖片，emmmm，先來幾張圖活躍一下氣氛！

咳咳咳…..接下來咱們就進入正題！

首先，進入半次元，點選COS，熱門推薦

點選F12，可以看到開發者工具視窗

我們以第一張COS照片的程式碼進行分析….額…第二張好看，還是從第二張開始吧。

紅框裡面就是這張圖片的html程式碼，然後我們以正常訪問方式點選圖片進入網頁,可以看到這張圖片解析度更高。

我們與之前HTML程式碼的圖片的URL進行比較

可以看見，2X3是我們第一次進網頁時第一張COS照片得到的URL，w650是進入COS照片詳細頁面後得到的URL，發現他們的區別是URL程式碼中的最後一段。

其他COS照片以此類推

我們在第一次進入的頁面繼續往下滑，發現該網頁滾到結尾時自動更新，可以確定網頁使用了AJAX技術，我們回到置頂掃清介面，等網頁載入好後按F12開啟開發者工具，操作如圖

點選XHR

我們繼續往下劃，等到頁面更新時發現新條碼點選條目

在Headers頁面往下滑，看見X-Reauested-With:XMLHttpRequest，表明是AJAX請求，找到Query String Parameters，這就是AJAX請求的資料，在Preview中可以看見AJAX傳回的資料。

繼續往下劃，讓網頁資料更新，發現Network中又新出現的幾個新條目

我們比較Query String Parameters的AJAX請求資料，發現

grid_type:
flow
sort:
hot

3.tag_id:
399

這三條資料和其他Network條目是相同的，但是since不相同，和其他條目對比http://25853.xxx其中xxx這三個數字是不規律的，其中since中25853小數點後的資料為565、523、483、428（以實際情況為準，僅供參考），意味著我們在接下來圖片爬取中since資料要手動輸入

import re
import requests
from pyquery import PyQuery as pq
import time
from urllib.parse import urlencode

Filepath='cosplay' #目錄檔案名 可以修改 註意不要含有"/"

def get_html(url):
    proxies = {"https": "https://183.129.207.73:14823", "https": "https://114.215.95.188:3128", }
    essay-headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    getweb = requests.get(str(url),essay-headers=essay-headers,proxies=proxies)
    try:
        return getweb.text
    except Exception as e:
        print(e)
    except IOError as e1:
        print(e1)

def DownloadFileWithFilename(url,filename,path):
    import requests
    import os
    import re
    try:
        if not os.path.exists(path):
            os.mkdir(path)
        if not os.path.exists(path):
            r = requests.get(url)
        r = requests.get(url)
        with open(str(path) + "/"+str(filename), "wb") as code:
            code.write(r.content)
            print('Downloaded!',str(path) + "/"+str(filename))
    except IOError as e:
        print("Download Failed!")
        print(e)
    except Exception as e1:
        print(e1)
        print("Download Failed!")


def getStaticHtmlImage(): #獲取沒有AJAX更新時網頁的COSPLAY圖片
    global Filepath
    web_static_state='https://bcy.net/coser'
    doc = pq(web_static_state)
    image = doc('li.js-smallCards._box a.db.posr.ovf img.cardImage').items()

    for i in image:  # 爬取ajax網頁資料
        i = str(i.attr('src')).rstrip('/2X3')  # 這裡的i是把獲取的URL最後一段/2x3去除
        filename = str(re.search('[^/]+(?!.*/)', i).group(0))  # filename是URL的最後一段:xxx.jpg
        i2 = i + str('/w650')  # i2是高畫質圖片URL:xxxxx/w650
        DownloadFileWithFilename(i2, filename, Filepath)
        time.sleep(3) #休眠三秒 防止封IP

def getDynamicHtmlImage(since1): #獲取ajax更新資料的COSPLAY圖片
    global Filepath
    ajax_get_data = {
        'since':since1,
        'grid_type':'flow',
        'sort':'hot',
        'tag_id':'399',
    }

    proxies = {"https": "https://183.129.207.73:14823", "https": "https://114.215.95.188:3128", }
    essay-headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

    web_dynamic = requests.get('https://bcy.net/circle/timeline/showtag?'+urlencode(ajax_get_data),essay-headers=essay-headers, proxies=proxies).text
    doc = pq(web_dynamic)
    image = doc('li.js-smallCards._box a.db.posr.ovf img.cardImage').items()

    for i in image:  # 爬取ajax網頁資料
        i = str(i.attr('src')).rstrip('/2X3')  # 這裡的i是把獲取的URL最後一段/2x3去除
        filename = str(re.search('[^/]+(?!.*/)', i).group(0))  # filename是URL的最後一段:xxx.jpg
        i2 = i + str('/w650')  # i2是高畫質圖片URL:xxxxx/w650
        DownloadFileWithFilename(i2, filename, Filepath)
        time.sleep(3) #休眠三秒 防止封IP

getStaticHtmlImage()

list_since = ['25861.565','25861.523','25861.483','25861.428'] #ajax請求的since

for i in list_since:
    print(i)
    getDynamicHtmlImage(i)

爬取下來的小姐姐

很激動……

python3 爬取半次元cosplay圖片

相關推薦

熱門標籤

熱門文章

分享創造快樂