本文來自作者在 GitChat 上分享「如何用 Python 爬取網(wǎng)頁制作電子書」主題內容。使" />
時間:2023-07-04 06:48:02 | 來源:網(wǎng)站運營
時間:2023-07-04 06:48:02 來源:網(wǎng)站運營
教你如何用 Python 爬取網(wǎng)頁制作電子書:作者簡介:孫亖,軟件工程師,長期從事企業(yè)信息化系統(tǒng)的研發(fā)工作,主要擅長后臺業(yè)務功能的設計開發(fā)。
本文來自作者在 GitChat 上分享「如何用 Python 爬取網(wǎng)頁制作電子書」主題內容。
# -*- coding: utf-8 -*-import scrapyclass XzxzbSpider(scrapy.Spider): name = 'xzxzb' allowed_domains = ['qidian.com'] start_urls = ['http://qidian.com/'] def parse(self, response): pass
start_urls 就是目錄地址,爬蟲會自動爬這個地址,然后結果就在下面的 parse 中處理?,F(xiàn)在我們就來編寫代碼處理目錄數(shù)據(jù),首先爬取小說的主頁,獲取目錄列表: def parse(self, response): pages = response.xpath('//div[@id="j-catalogWrap"]//ul[@class="cf"]/li') for page in pages: url = page.xpath('./child::a/attribute::href').extract() print url pass
獲取網(wǎng)頁中的 DOM 數(shù)據(jù)有兩種方式,一種是使用 CSS 選擇子,另外一種是使用 XML 的 xPath 查詢。pages = response.xpath('//div[@id="j-catalogWrap"]//ul[@class="cf"]/li')
接著,遍歷子節(jié)點,并查詢 li 標簽內 a 子節(jié)點的 href 屬性,最后打印出來:for page in pages: url = page.xpath('./child::a/attribute::href').extract() print url
這樣,可以說爬取章節(jié)路徑的小爬蟲就寫好了,使用如下命令運行 xzxzb 爬蟲查看結果:scrapy crawl xzxzb
這個時候我們的程序可能會出現(xiàn)如下錯誤:…ImportError: No module named win32api…
運行下面的語句即可:pip install pypiwin32
屏幕輸出如下:> ...> [u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/wrrduN6auIlOBDFlr9quQA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/Jh-J5usgyW62uJcMpdsVgA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/5YXHdBvg1ImaGfXRMrUjdw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/fw5EBeKat-76ItTi_ILQ7A2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/KsFh5VutI6PwrjbX3WA1AA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/-mpKJ01gPp1p4rPq4Fd4KQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/MlZSeYOQxSPM5j8_3RRvhw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/5TXZqGvLi-3M5j8_3RRvhw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/sysD-JPiugv4p8iEw--PPw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/xGckZ01j64-aGfXRMrUjdw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/72lHOJcgmedOBDFlr9quQA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/cZkHZEYnPl22uJcMpdsVgA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/vkNh45O3JsRMs5iq0oQwLQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/ge4m8RjJyPH6ItTi_ILQ7A2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/Y33PuxrKT4dp4rPq4Fd4KQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/MDQznkrkiyXwrjbX3WA1AA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/A2r-YTzWCYj6ItTi_ILQ7A2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/Ng9CuONRKei2uJcMpdsVgA2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/Q_AxWAge14pMs5iq0oQwLQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/ZJshvAu8TVVp4rPq4Fd4KQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/hYD2P4c5UB2aGfXRMrUjdw2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/muxiWf_jpqTgn4SMoDUcDQ2'][u'//read.qidian.com/chapter/MuRzJqCY6MyoLoerY3WDhg2/OQQ5jbADJjVp4rPq4Fd4KQ2']> ...
爬取章節(jié)路徑的小爬蟲就寫好了,但我們的目的不僅于此,我們接下來使用這些地址來抓取內容:# -*- coding: utf-8 -*-import scrapyclass XzxzbSpider(scrapy.Spider): name = 'xzxzb' allowed_domains = ['qidian.com'] start_urls = ['https://book.qidian.com/info/1010780117/'] def parse(self, response): pages = response.xpath('//div[@id="j-catalogWrap"]//ul[@class="cf"]/li') for page in pages: url = page.xpath('./child::a/attribute::href').extract_first() # yield scrapy.Request('https:' + url, callback=self.parse_chapter) yield response.follow(url, callback=self.parse_chapter) pass def parse_chapter(self, response): title = response.xpath('//div[@class="main-text-wrap"]//h3[@class="j_chapterName"]/text()').extract_first().strip() content = response.xpath('//div[@class="main-text-wrap"]//div[@class="read-content j_readContent"]').extract_first().strip() print title # print content pass
上一步,我們獲取到了一個章節(jié)地址,從輸出內容來看是相對路徑,因此我們使用了yield response.follow(url, callback=self.parse_chapter),第二個參數(shù)是一個回調函數(shù),用來處理章節(jié)頁面,爬取到章節(jié)頁面后我們解析頁面和標題保存到文件。next_page = response.urljoin(url)yield scrapy.Request(next_page, callback=self.parse_chapter)
scrapy.Request 不同于使用 response.follow,需要通過相對路徑構造出絕對路徑,response.follow 可以直接使用相對路徑,因此就不需要調用 urljoin 方法了。 def parse_chapter(self, response): title = response.xpath('//div[@class="main-text-wrap"]//h3[@class="j_chapterName"]/text()').extract_first().strip() content = response.xpath('//div[@class="main-text-wrap"]//div[@class="read-content j_readContent"]').extract_first().strip() # print title # print content filename = './down/%s.html' % (title) with open(filename, 'wb') as f: f.write(content.encode('utf-8')) pass
至此,我們已經(jīng)成功的抓取到了我們的數(shù)據(jù),但還不能直接使用,需要整理和優(yōu)化。 def parse(self, response): pages = response.xpath('//div[@id="j-catalogWrap"]//ul[@class="cf"]/li') for page in pages: url = page.xpath('./child::a/attribute::href').extract_first() idx = page.xpath('./attribute::data-rid').extract_first() # yield scrapy.Request('https:' + url, callback=self.parse_chapter) req = response.follow(url, callback=self.parse_chapter) req.meta['idx'] = idx yield req pass def parse_chapter(self, response): idx = response.meta['idx'] title = response.xpath('//div[@class="main-text-wrap"]//h3[@class="j_chapterName"]/text()').extract_first().strip() content = response.xpath('//div[@class="main-text-wrap"]//div[@class="read-content j_readContent"]').extract_first().strip() # print title # print content filename = './down/%s_%s.html' % (idx, title) cnt = '<h1>%s</h1> %s' % (title, content) with open(filename, 'wb') as f: f.write(cnt.encode('utf-8')) pass
關鍵詞:電子
微信公眾號
版權所有? 億企邦 1997-2025 保留一切法律許可權利。