Python簡單爬蟲案例

用pyhton從網頁中爬取數據，是比較常用的爬蟲方式。網頁一般由html編寫，裏面包含大量的標簽，我們所需的內容都包含在這些標簽之中，除了對python的基礎語法有了解之外，還要對html的結構以及標簽選擇有簡單的認知，下面就用爬取fl小說網的案例帶大家進入爬蟲的世界

一、實現步驟1.1 導入依賴

網頁內容依賴

import requests，如沒有下載依賴，在terminal處輸出pip install requests，系統會自動導入依賴

解析內容依賴

常用的有BeautifulSoup、parsel、re等等

與上面步驟一樣，如沒有依賴，則在terminal處導入依賴

導入BeautifulSoup依賴

pip install bs4

導入pasel依賴

pip install parsel

使用依賴

from bs4 import BeautifulSoupimport requestsimport parselimport re1.2 獲取數據

簡單的獲取網頁，網頁文本

response = requests.get(url).text

對于很多網站可能需要用戶身份登錄，此時用headers僞裝，此內容可以在浏覽器f12獲得

headers = { 'Cookie': 'cookie，非真實的', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}headers = { 'Host': 'www.qidian.com', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'navigate'}

僞裝後獲取網頁數據

response = requests.get(url=url,headers=headers).get.text

甚至還有些跟SSL證書相關，還需設置proxies

proxies = { 'http': 'http://127.0.0.1:9000', 'https': 'http://127.0.0.1:9000'}response = requests.get(url=url,headers=headers, proxies=proxies).get.text1.3 解析數據

數據的解析有幾種方式，比如xpath，css, re。

css顧名思義，就是html標簽解析方式了。

re是正則表達式解析。

1.4 寫入文件with open(titleName + '.txt', mode='w', encoding='utf-8') as f: f.write(content)

open函數打開文件IO，with函數讓你不用手動關閉IO流，類似Java中Try catch模塊中try()引入IO流。

第一個函數爲文件名，mode爲輸入模式，encoding爲編碼，還有更多的參數，可以自行研究。

write爲寫入文件。

二、完整案例import requestsimport parsellink = '小說起始地址，法律原因不給出具體的'link_data = requests.get(url=link).textlink_selector = parsel.Selector(link_data)href = link_selector.css('.DivTr a::attr(href)').getall()for index in href: url = f'https:{index}' print(url) response = requests.get(url, headers) html_data = response.text selector = parsel.Selector(html_data) title = selector.css('.c_l_title h1::text').get() content_list = selector.css('div.noveContent p::text').getall() content = '\n'.join(content_list) with open(title + '.txt', mode='w', encoding='utf-8') as f: f.write(content)

以上案例可以獲取fl小說網的免費章節，那麽付費章節呢

付費章節是照片形式的存在，找到照片然後用百度雲計算解析照片的文字即可，爬取付費內容是違法行爲，這部分代碼不能提供

作者：天道佩恩鏈接：https://juejin.cn/post/7385350484411056154

娛樂新聞吧

互聯架構唠唠嗑