時(shí)間:2023-05-20 08:42:02 | 來源:網(wǎng)站運(yùn)營(yíng)
時(shí)間:2023-05-20 08:42:02 來源:網(wǎng)站運(yùn)營(yíng)
新浪微博爬蟲實(shí)現(xiàn)(附核心Python代碼):如何爬取新浪微博數(shù)據(jù)? payload = { 'username': '156****1997', 'password': '**********'} #設(shè)置請(qǐng)求頭文件信息 header_init = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding':'gzip, deflate, br', 'Connection':'close', 'Referer':'https://weibo.com/askcliff?is_all=1' } #微博登陸頁(yè)URL url_login='https://passport.weibo.cn/signin/login' #設(shè)置一個(gè)會(huì)話對(duì)象 s = requests.Session() #以post形式提交登陸用戶名和密碼 s.post(url=url_login, data=payload, headers=header_init)
另外一種是模擬瀏覽器登錄,利用瀏覽器的Cookies信息,這個(gè)信息可以在網(wǎng)頁(yè)的后臺(tái)控制中很容易找到,注意要以字典的形式存儲(chǔ)在變量中。[代碼:Cookies示例(在XHR的JS中的Header Requests可以找到)] #微博內(nèi)容抓取頁(yè)URL url_init = url_base+'&page={}' #設(shè)置Cookie的內(nèi)容 cookie={ 'MLOGIN':'1', 'M_WEIBOCN_PARAMS':'luicode%3D10000011%26lfid%3D100103type%253D3%2526q%253D%25E5%25AF%2585%25E5%25AD%2590%2526t%253D0%26featurecode%3D20000320%26oid%3D3900004206849232%26fid%3D1005053628359543%26uicode%3D10000011&page={}', 'SCF':'AkhONeuuFcGAlAz0kgavz1wRbp1fz7ZGn0Xn_zPHzoa0B_VbPTNxInVDSaycKttiCUPGwlxaxxqJG', 'SUB':'_2A252C9fsDeRhGeNI41QZ-CrEyzqIHXVV9_mkrDV6PUJbkdAKLW_CkW1NSDUIJok_9iLiEAocyWlucWgHT-UKNQiO', 'SUHB':'0A0JidXol5dQPP', 'WEIBOCN_FROM':'1110006030', '_T_WM':'28789df2dacda9b86d0a2ffa60adbfe8', }
一般來說,模擬瀏覽器比模擬登陸的方式用時(shí)更短。# 發(fā)博人creen_name = jd['data']['cards'][count]['mblog']['user']['screen_name']result['微博用戶'].append(screen_name)# 這部分是獲取微博用戶的文本內(nèi)容# 發(fā)表時(shí)間a = ('/''+jd['data']['cards'][count]['mblog']['created_at']+'/'').count('-')if a == 2:date = datetime.strptime(jd['data']['cards'][count]['mblog']['created_at'],'%Y-%m-%d')datesrt = date.strftime('%Y-%m-%d')if a == 1:date = datetime.strptime(jd['data']['cards'][count]['mblog']['created_at'],'%m-%d')datesrt = date.strftime('2018'+'-%m-%d')if a == 0:datesrt = '今天或昨天'result['發(fā)博日期'].append(datesrt)
這部分是關(guān)于發(fā)博日期的獲取。之所以用了很多判斷,是它的時(shí)間呈現(xiàn)方式導(dǎo)致。新浪微博中會(huì)有類似于“1小時(shí)前”、“今天”“5天前”等非規(guī)范的時(shí)間表達(dá)方式。(規(guī)范的方式為xxxx年xx月xx日)def PageData(url,cookie,header): #提交請(qǐng)求獲取要抓取的頁(yè)面信息 res = requests.get(url=url, cookies=cookie, headers=header) #讀取頁(yè)面內(nèi)容 jd = json.loads(res.text) count = -1 result = {} result['id'] = [] result['微博用戶'] = [] result['發(fā)博日期'] = [] result['微博文本'] = [] result['附帶圖片鏈接'] = [] result['微博鏈接'] = [] result['點(diǎn)贊數(shù)'] = [] result['評(píng)論數(shù)'] = [] result['轉(zhuǎn)發(fā)數(shù)'] = [] for i in jd['data']['cards']: count = count + 1 if jd['data']['cards'][count]['card_type'] == 9: #發(fā)博人 screen_name = jd['data']['cards'][count]['mblog']['user']['screen_name'] result['微博用戶'].append(screen_name) #發(fā)表時(shí)間 a = ('/''+jd['data']['cards'][count]['mblog']['created_at']+'/'').count('-') if a == 2: date = datetime.strptime(jd['data']['cards'][count]['mblog']['created_at'],'%Y-%m-%d') datesrt = date.strftime('%Y-%m-%d') if a == 1: date = datetime.strptime(jd['data']['cards'][count]['mblog']['created_at'],'%m-%d') datesrt = date.strftime('2018'+'-%m-%d') if a == 0: datesrt = '今天或昨天' result['發(fā)博日期'].append(datesrt) #微博內(nèi)容 text = jd['data']['cards'][count]['mblog']['text'] text = filter_emoji(text,restr='') soup = BeautifulSoup(text,'html.parser') text = soup.get_text() result['微博文本'].append(text) #微博所附圖片鏈接 if 'original_pic' in jd['data']['cards'][count]['mblog'].keys(): original_pic = jd['data']['cards'][count]['mblog']['original_pic'] else: original_pic = '無圖片鏈接' result['附帶圖片鏈接'].append(original_pic) #微博網(wǎng)頁(yè)鏈接 html = jd['data']['cards'][count]['scheme'] result['微博鏈接'].append(html) #點(diǎn)贊數(shù)量 attitudes_count = jd['data']['cards'][count]['mblog']['attitudes_count'] result['點(diǎn)贊數(shù)'].append(attitudes_count) #評(píng)論數(shù)量 comments_count = jd['data']['cards'][count]['mblog']['comments_count'] result['評(píng)論數(shù)'].append(comments_count) #轉(zhuǎn)發(fā)數(shù)量 reposts_count = jd['data']['cards'][count]['mblog']['reposts_count'] result['轉(zhuǎn)發(fā)數(shù)'].append(reposts_count) return result
def to_sql(weibo_total,dbname): #寫入數(shù)據(jù)庫(kù) conn = pymysql.connect(user='spider',password='****',host='***.***.**.**',port=3306,database='spider',use_unicode=True,charset="utf8") cs = conn.cursor() #整理字典數(shù)據(jù) for i in range(len(weibo_total['微博用戶'])): data = '' for k in weibo_total.keys(): data = (data + '/'' + '{}' + '/'' + ',').format(weibo_total[k][i]) #data = '/"'+ data[:-1] + '/"' #SQL語(yǔ)句執(zhí)行 sql = ("""INSERT INTO %s VALUES (%s)""") % (dbname,data[:-1]) cs.execute(sql) cs.execute("SELECT * FROM %s"%dbname) conn.commit() print(cs.fetchall()) conn.close()
由以上代碼所示,首先鏈接數(shù)據(jù)庫(kù),然后整理儲(chǔ)存在字典中的數(shù)據(jù),之后執(zhí)行操作將數(shù)據(jù)存入數(shù)據(jù)庫(kù)。此代碼適用于所有網(wǎng)頁(yè)版移動(dòng)端微博數(shù)據(jù)的爬取。#過濾emoji表情def filter_emoji(desstr,restr=''): try: co = re.compile(u'[/U00010000-/U0010ffff]') except re.error: co = re.compile(u'[/uD800-/uDBFF][/uDC00-/uDFFF]') return co.sub(restr, desstr)
如有錯(cuò)誤或理解不到位之處,還希望大家及時(shí)指正哇。畢竟代碼比較久遠(yuǎn),有些細(xì)節(jié)不太清楚了~下次更新預(yù)告:利用卷積實(shí)現(xiàn)人臉識(shí)別(附Python核心代碼)
關(guān)鍵詞:核心,實(shí)現(xiàn),爬蟲
客戶&案例
營(yíng)銷資訊
關(guān)于我們
客戶&案例
營(yíng)銷資訊
關(guān)于我們
微信公眾號(hào)
版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。