国产成人精品无码青草_亚洲国产美女精品久久久久∴_欧美人与鲁交大毛片免费_国产果冻豆传媒麻婆精东

15158846557 在線咨詢 在線咨詢
15158846557 在線咨詢
所在位置: 首頁 > 營銷資訊 > 網(wǎng)站運營 > 爬蟲系列教程五:動態(tài)網(wǎng)頁api分析實例之爬取dropbox上的pdf

爬蟲系列教程五:動態(tài)網(wǎng)頁api分析實例之爬取dropbox上的pdf

時間:2023-05-24 05:18:01 | 來源:網(wǎng)站運營

時間:2023-05-24 05:18:01 來源:網(wǎng)站運營

爬蟲系列教程五:動態(tài)網(wǎng)頁api分析實例之爬取dropbox上的pdf:

動態(tài)網(wǎng)頁api分析實例:爬取dropbox上的pdf

今天老師讓我下載一個網(wǎng)課里面的pdf材料,pdf的數(shù)目比較多,一部分pdf是放在dropbox上面的,看了一下是一個動態(tài)網(wǎng)頁。想起來我的爬蟲教程好久沒填坑了,今天我就打算拿爬蟲來完成這個任務(wù),順便寫個教程,我今天選擇的方式是分析api,下次再遇到動態(tài)網(wǎng)頁寫博客的時候,我就用js引擎(flag已經(jīng)立下了)。廢話不多說,下面開始。

任務(wù)

  1. 爬取的頁面:https://aisecure.github.io/TEACHING/cs598.html



  1. 具體爬取要求,根據(jù)時間創(chuàng)建文件夾,然后把當(dāng)天的所有pdf文件放在文件夾中;

分析

  1. 拿到任務(wù)后我首先分析了一下要爬取的文件:一類是論文,一類是幻燈片;論文比較好爬取,但是幻燈片是放在dropbox上面的,有一定難度;
  2. 這個主頁面是一個靜態(tài)的頁面,沒啥難度,拿到網(wǎng)頁信息后對網(wǎng)頁結(jié)構(gòu)解析后搜索一下,就能拿到我們需要爬取內(nèi)容。
  3. 其中論文部分的url比較好分析,直接從對應(yīng)元素的href屬性中拿到pdf的url:



  1. dropbox上面的pdf文件的下載鏈接是放在動態(tài)網(wǎng)頁上的的,所以需要費點周折。本次采用的方式是分析api。首先抓一波包,發(fā)現(xiàn)點擊了立即下載的按鈕后,有兩個包比較重要












寫代碼和debug

  1. 注意事項
2. 具體代碼

import requestsfrom bs4 import BeautifulSoupimport osimport re# use proxies to speed upproxies = {"http": "socks5://127.0.0.1:10808","https": "socks5://127.0.0.1:10808",}data = requests.get("https://aisecure.github.io/TEACHING/cs598.html")soup = BeautifulSoup(data.content, 'html.parser')entries = soup.find_all("tr")entries = list(entries)for k, i in enumerate(entries[1:]): i = str(i) entry_data = BeautifulSoup(i, 'html.parser') date_and_sides = entry_data.find_all(class_="tg-0pky") readings = entry_data.find_all(class_="tg-reading") if date_and_sides!= []: date_list = str.split(date_and_sides[0].string, '/') print(date_list) month = date_list[0] day = date_list[1] if len(month)==1: month = '0'+month if len(day)==1: day = '0'+day month_day = month+day print(month_day) if len(date_and_sides) == 2: if not os.path.exists(month_day): os.mkdir(month_day) if readings!=[]:#readings readings = str(readings[0]) readings = BeautifulSoup(readings, 'html.parser') for link in readings.find_all('a'): pdf_link = link.get('href') if 'http' not in pdf_link: break elif 'openreview' in pdf_link: pdf_link = pdf_link.replace('forum', 'pdf') elif 'arxiv' in pdf_link and 'pdf' not in pdf_link: pdf_link = pdf_link + '.pdf' print(pdf_link) pdf_name = pdf_link.split('/')[-1] if '.pdf' not in pdf_name: pdf_name = pdf_name.split('=')[-1]+'.pdf' print(pdf_name) pdf_data = requests.get(pdf_link, proxies=proxies) f = open(month_day+'/'+pdf_name,'wb') f.write(pdf_data.content) f.close() slides = date_and_sides[1]#slides slides = str(slides) slides = BeautifulSoup(slides, 'html.parser') for link in slides.find_all('a'): pdf_link = link.get('href') print(pdf_link) if 'dropbox' in pdf_link and k!=9 and k!=10:#k==9, slide file is in google driver, we don't have access to it;10 file not exits url = 'https://www.dropbox.com/sharing/fetch_user_content_link' cookies = {'__Host-ss':'bcD4Chza3M', 'locale':'zh_CN', 'gvc':'MTQxMzI3NDU0NjU2NzAyODExNDM4MzQ3NTk2NDExMjgyNjc2MzI2', 't':'-VB7vYgNnBuMG3LhS_GfEzTL', '__Host-js_csrf':'-VB7vYgNnBuMG3LhS_GfEzTL', 'seen-sl-signup-modal':'VHJ1ZQ%3D%3D', 'seen-sl-download-modal':'VHJ1ZQ%3D%3D'} data={ 'is_xhr': 'true', 't': '-VB7vYgNnBuMG3LhS_GfEzTL', 'url': pdf_link } slide_data = requests.post(url, data=data, proxies = proxies, cookies = cookies) middle_url = str(slide_data.content) print(middle_url) middle_url = middle_url.split('?')[0] middle_url = middle_url[2:] data_2={ '_download_id':'013885563736029338651059959499724834269999834692877836324471532568', '_notify_domain':'www.dropbox.com', 'dl':'1' } pdf_data = requests.get(middle_url, data=data_2, proxies = proxies) pdf_name = pdf_link.split('/')[-1] pdf_name = pdf_name.split('?')[0] print(pdf_name) f = open(month_day+'/'+pdf_name,'wb') f.write(pdf_data.content) f.close() elif len(date_and_sides) == 1: if not os.path.exists(month_day): os.mkdir(month_day) if readings!=[]:#readings readings = str(readings[0]) readings = BeautifulSoup(readings, 'html.parser') for link in readings.find_all('a'): pdf_link = link.get('href') if 'http' not in pdf_link: break elif 'openreview' in pdf_link: pdf_link = pdf_link.replace('forum', 'pdf') elif 'arxiv' in pdf_link and 'pdf' not in pdf_link: pdf_link = pdf_link + '.pdf' print(pdf_link) pdf_name = pdf_link.split('/')[-1] if '.pdf' not in pdf_name: pdf_name = pdf_name.split('=')[-1]+'.pdf' print(pdf_name) pdf_data = requests.get(pdf_link, proxies=proxies) f = open(month_day+'/'+pdf_name,'wb') f.write(pdf_data.content) f.close() slides = date_and_sides[1]#slides slides = str(slides) slides = BeautifulSoup(slides, 'html.parser') for link in slides.find_all('a'): pdf_link = link.get('href') print(pdf_link) if 'dropbox' in pdf_link and k!=9 and k!=10:#k==9, slide file is in google driver, we don't have access to it;10 file not exits url = 'https://www.dropbox.com/sharing/fetch_user_content_link' cookies = {'__Host-ss':'bcD4Chza3M', 'locale':'zh_CN', 'gvc':'MTQxMzI3NDU0NjU2NzAyODExNDM4MzQ3NTk2NDExMjgyNjc2MzI2', 't':'-VB7vYgNnBuMG3LhS_GfEzTL', '__Host-js_csrf':'-VB7vYgNnBuMG3LhS_GfEzTL', 'seen-sl-signup-modal':'VHJ1ZQ%3D%3D', 'seen-sl-download-modal':'VHJ1ZQ%3D%3D'} data={ 'is_xhr': 'true', 't': '-VB7vYgNnBuMG3LhS_GfEzTL', 'url': pdf_link } slide_data = requests.post(url, data=data, proxies = proxies, cookies = cookies) middle_url = str(slide_data.content) print(middle_url) middle_url = middle_url.split('?')[0] middle_url = middle_url[2:] data_2={ '_download_id':'013885563736029338651059959499724834269999834692877836324471532568', '_notify_domain':'www.dropbox.com', 'dl':'1' } pdf_data = requests.get(middle_url, data=data_2, proxies = proxies) pdf_name = pdf_link.split('/')[-1] pdf_name = pdf_name.split('?')[0] print(pdf_name) f = open(month_day+'/'+pdf_name,'wb') f.write(pdf_data.content) f.close()另外本站的編輯器實在太拉跨了,直接放md文檔還是有問題的,距離上一篇文章已經(jīng)一年了,這個問題還是沒有得到解決,這個專欄以后就在博客上寫了。

關(guān)鍵詞:實例,分析,教程,系列,動態(tài),爬蟲

74
73
25
news

版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。

為了最佳展示效果,本站不支持IE9及以下版本的瀏覽器,建議您使用谷歌Chrome瀏覽器。 點擊下載Chrome瀏覽器
關(guān)閉