開車開車！Python爬取妹子圖-知識星球

環境：Python3.6

模組：requests、beautifulsoup4

步驟

1.獲取網頁原始碼

用瀏覽器的F12可以看到網頁的原始碼，但我們現在要用python這樣做。

如果沒問題的話結果會是這樣類似

這些就是網頁的原始碼了。

2.尋找所需資訊

切換瀏覽器，右鍵‘檢視網頁原始碼’，可以發現有很多這樣的資訊

<li><a href=“http://www.mzitu.com/89334” target=“_blank”><img width=‘236’ height=‘354’ class=‘lazy’ alt=‘我是標題我是標題我是標題我是標

那麼我們只需要提取這些東西就可以了…but:我寫不出這麼複雜的正則！！！沒關係，我們的神器BeautifulSoup要上場了！

我們的步驟是：①將獲取的原始碼轉換為BeautifulSoup物件②搜尋需要的資料

這樣就找到了當頁所有套圖的標題

3.進入下載

點進一個套圖之後，發現他是每個頁面顯示一個圖片。

比如http://www.mzitu.com/26685是某個套圖的第一頁，後面的頁數都是再後面跟/和數字http://www.mzitu.com/26685/2 (第二頁)，那麼很簡單了，我們只需要找到他一共多少頁，然後用迴圈組成頁數就OK了。

好了，那麼我們接下來就是進行尋找圖片地址並且儲存了，右鍵妹紙圖片，“檢查”可以發現

那麼這就是圖片的具體地址了，那麼我們的目的就要達成，儲存它。

不出意外你指令碼所在的地方會有12張圖片

4.批次下載

上面講瞭如何下載一個套圖，其實所有的套圖下載只是加上兩重迴圈。成品如下：

#coding=utf-8

import requests

from bs4 import BeautifulSoup

import os

import sys

”’

#安卓端需要此陳述句

reload(sys)

sys.setdefaultencoding(‘utf-8’)

”’

if(os.name == ‘nt’):

print(u’你正在使用win平臺’)

else:

print(u’你正在使用linux平臺’)

essay-header = {‘User-Agent’:’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36′}

#http請求頭

all_url = ‘http://www.mzitu.com’

start_html = requests.get(all_url,essay-headers = essay-header)

#儲存地址

path = ‘D:/mzitu/’

#找尋最大頁數

soup = BeautifulSoup(start_html.text,”html.parser”)

page = soup.find_all(‘a’,class_=’page-numbers’)

max_page = page[-2].text

same_url = ‘http://www.mzitu.com/page/’

for n in range(1,int(max_page)+1):

ul = same_url+str(n)

start_html = requests.get(ul, essay-headers=essay-header)

soup = BeautifulSoup(start_html.text,”html.parser”)

all_a = soup.find(‘div’,class_=’postlist’).find_all(‘a’,target=’_blank’)

for a in all_a:

title = a.get_text() #提取文字

if(title != ”):

print(“準備扒取：”+title)

#win不能建立帶？的目錄

if(os.path.exists(path+title.strip().replace(‘?’,”))):

#print(‘目錄已存在’)

flag=1

else:

os.makedirs(path+title.strip().replace(‘?’,”))

flag=0

os.chdir(path + title.strip().replace(‘?’,”))

href = a[‘href’]

html = requests.get(href,essay-headers = essay-header)

mess = BeautifulSoup(html.text,”html.parser”)

pic_max = mess.find_all(‘span’)

pic_max = pic_max[10].text #最大頁數

if(flag == 1 and len(os.listdir(path+title.strip().replace(‘?’,”))) >= int(pic_max)):

print(‘已經儲存完畢，跳過’)

continue

for num in range(1,int(pic_max)+1):

pic = href+’/’+str(num)

html = requests.get(pic,essay-headers = essay-header)

mess = BeautifulSoup(html.text,”html.parser”)

pic_url = mess.find(‘img’,alt = title)

html = requests.get(pic_url[‘src’],essay-headers = essay-header)

file_name = pic_url[‘src’].split(r’/’)[-1]

f = open(file_name,’wb’)

f.write(html.content)

f.close()

print(‘完成’)

print(‘第’,n,’頁完成’)

作者：嵐漾憶雨

源自：http://blog.csdn.net/baidu_35085676/article/details/68958267

宣告：文章著作權歸作者所有，如有侵權，請聯絡小編刪除

開車開車！Python爬取妹子圖

步驟

1.獲取網頁原始碼

2.尋找所需資訊

3.進入下載

4.批次下載

相關推薦

熱門標籤

熱門文章

分享創造快樂