時(shí)間:2023-05-08 05:54:01 | 來(lái)源:網(wǎng)站運(yùn)營(yíng)
時(shí)間:2023-05-08 05:54:01 來(lái)源:網(wǎng)站運(yùn)營(yíng)
手把手教你使用python抓取并存儲(chǔ)網(wǎng)頁(yè)數(shù)據(jù)?。?br>https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3
現(xiàn)在啟動(dòng)Jupyter notebook,并運(yùn)行以下代碼import requestsurl = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'res = requests.get('url')print(res.status_code)#200
在上面的代碼中,我們完成了下面三件事from bs4 import BeautifulSouppage = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')title = soup.title.text print(title)# 熱門(mén)視頻排行榜 - 嗶哩嗶哩 (゜-゜)つロ 干杯~-bilibili
在上面的代碼中,我們通過(guò)bs4中的BeautifulSoup類(lèi)將上一步得到的html格式字符串轉(zhuǎn)換為一個(gè)BeautifulSoup對(duì)象,注意在使用時(shí)需要制定一個(gè)解析器,這里使用的是html.parser。all_products = []products = soup.select('li.rank-item')for product in products: rank = product.select('div.num')[0].text name = product.select('div.info > a')[0].text.strip() play = product.select('span.data-box')[0].text comment = product.select('span.data-box')[1].text up = product.select('span.data-box')[2].text url = product.select('div.info > a')[0].attrs['href'] all_products.append({ "視頻排名":rank, "視頻名": name, "播放量": play, "彈幕量": comment, "up主": up, "視頻鏈接": url })
在上面的代碼中,我們先使用soup.select('li.rank-item'),此時(shí)返回一個(gè)list包含每一個(gè)視頻信息,接著遍歷每一個(gè)視頻信息,依舊使用CSS選擇器來(lái)提取我們要的字段信息,并以字典的形式存儲(chǔ)在開(kāi)頭定義好的空列表中。import csvkeys = all_products[0].keys()with open('B站視頻熱榜TOP100.csv', 'w', newline='', encoding='utf-8-sig') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(all_products)
如果你熟悉pandas的話,更是可以輕松將字典轉(zhuǎn)換為DataFrame,一行代碼即可完成import pandas as pdkeys = all_products[0].keys()pd.DataFrame(all_products,columns=keys).to_csv('B站視頻熱榜TOP100.csv', encoding='utf-8-sig')
import requestsfrom bs4 import BeautifulSoupimport csvimport pandas as pdurl = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'page = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')all_products = []products = soup.select('li.rank-item')for product in products: rank = product.select('div.num')[0].text name = product.select('div.info > a')[0].text.strip() play = product.select('span.data-box')[0].text comment = product.select('span.data-box')[1].text up = product.select('span.data-box')[2].text url = product.select('div.info > a')[0].attrs['href'] all_products.append({ "視頻排名":rank, "視頻名": name, "播放量": play, "彈幕量": comment, "up主": up, "視頻鏈接": url })keys = all_products[0].keys()with open('B站視頻熱榜TOP100.csv', 'w', newline='', encoding='utf-8-sig') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(all_products)### 使用pandas寫(xiě)入數(shù)據(jù)pd.DataFrame(all_products,columns=keys).to_csv('B站視頻熱榜TOP100.csv', encoding='utf-8-sig')
關(guān)鍵詞:數(shù)據(jù),把手,使用
客戶&案例
營(yíng)銷(xiāo)資訊
關(guān)于我們
客戶&案例
營(yíng)銷(xiāo)資訊
關(guān)于我們
微信公眾號(hào)
版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。