理論架構(gòu)想要實現(xiàn)一個搜索引擎,首先需要考慮出完整的架構(gòu)。

頁面抓取存儲分析搜索實現(xiàn)展現(xiàn)頁面抓取首先,頁面抓取我打算" />

国产成人精品无码青草_亚洲国产美女精品久久久久∴_欧美人与鲁交大毛片免费_国产果冻豆传媒麻婆精东

18143453325 在線咨詢 在線咨詢
18143453325 在線咨詢
所在位置: 首頁 > 營銷資訊 > 電子商務(wù) > 手寫一個垂直領(lǐng)域的搜索引擎

手寫一個垂直領(lǐng)域的搜索引擎

時間:2023-03-20 06:22:01 | 來源:電子商務(wù)

時間:2023-03-20 06:22:01 來源:電子商務(wù)

前言:想要谷歌和百度已經(jīng)夠用了,這里實現(xiàn)的搜索只是為了方便自己做后續(xù)的事情的一個小實踐。 ?

理論架構(gòu)

想要實現(xiàn)一個搜索引擎,首先需要考慮出完整的架構(gòu)。

頁面抓取

首先,頁面抓取我打算采取最簡單的HttpClient的方式,可能有人會說,你這樣做會漏掉大量使用Web2.0的網(wǎng)站的,是的,沒錯,最開始我為了驗證架構(gòu)的可用性,就是要漏掉一些復(fù)雜的點。

存儲

然后,存儲,我打算直接使用文件系統(tǒng)進行實體存儲,在搜索使用的時候,全部將結(jié)果加載到內(nèi)存中??赡苡械娜藭f,你這樣好消耗內(nèi)存哦,是的,沒錯,我可以用大量的swap空間,用性能換內(nèi)存。

分析

分析部分,我打算直接使用分詞算法,解析出詞頻,建立文章的倒排索引,但是不存儲文章的所有詞語的倒排索引,畢竟要考慮到未優(yōu)化的文件系統(tǒng)的存取性能。我這里的方案是直接取詞頻在20~50范圍內(nèi)的詞以及網(wǎng)站標題的分詞結(jié)果作為網(wǎng)站的關(guān)鍵詞,建立倒排系統(tǒng)而存在。為了描述不顯得那么空白和抽象,這里貼出最后的結(jié)構(gòu):

文件的標題名就是分詞的詞語名,文件里面存儲的是所有關(guān)鍵詞有該詞的網(wǎng)站域名,如下:

有點類似elasticsearch底層的存儲原理,不過我沒有做什么優(yōu)化。

搜索實現(xiàn)

搜索實現(xiàn)部分,我打算直接將上述文件加載到內(nèi)存中,直接使用HashMap存儲,方便讀取。

展現(xiàn)

為了方便隨點隨用,我打算直接使用谷歌瀏覽器插件的形式進行展現(xiàn)實現(xiàn)。 ?

好了,現(xiàn)在理論架構(gòu)差不多了,那么就開始動手實現(xiàn)吧

動手實現(xiàn)

頁面抓取

剛才提到了,這里直接使用HttpClient進行頁面抓取,除此之外,還涉及對頁面的外鏈解析。在說外鏈解析之前,我打算先說說我的抓取思路。 ?

將整個互聯(lián)網(wǎng)想象成一張巨大的網(wǎng),網(wǎng)站間通過鏈接的方式相互串聯(lián),雖然這里面有大量的網(wǎng)站是孤島,但是不妨礙對絕大多數(shù)網(wǎng)站的抓取。所以這里采取的方案就是多點為主要節(jié)點的廣度優(yōu)先遍歷,對單個網(wǎng)站只抓取首頁的內(nèi)容,分析其中的所有外鏈,然后作為目標進行抓取。 ?

抓取頁面的代碼如下:

import com.chaojilaji.auto.autocode.generatecode.GenerateFile;import com.chaojilaji.auto.autocode.standartReq.SendReq;import com.chaojilaji.auto.autocode.utils.Json;import com.chaojilaji.moneyframework.model.OnePage;import com.chaojilaji.moneyframework.model.Word;import com.chaojilaji.moneyframework.service.Nlp;import com.chaojilaji.moneyframework.utils.DomainUtils;import com.chaojilaji.moneyframework.utils.HtmlUtil;import com.chaojilaji.moneyframework.utils.MDUtils;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.springframework.stereotype.Service;import org.springframework.util.StringUtils;import java.io.*;import java.util.*;import java.util.concurrent.ConcurrentHashMap;import java.util.concurrent.ConcurrentSkipListSet;public class HttpClientCrawl { private static Log logger = LogFactory.getLog(HttpClientCrawl.class); public Set<String> oldDomains = new ConcurrentSkipListSet<>(); public Map<String, OnePage> onePageMap = new ConcurrentHashMap<>(400000); public Set<String> ignoreSet = new ConcurrentSkipListSet<>(); public Map<String, Set<String>> siteMaps = new ConcurrentHashMap<>(50000); public String domain; public HttpClientCrawl(String domain) { this.domain = DomainUtils.getDomainWithCompleteDomain(domain); String[] ignores = {"gov.cn", "apac.cn", "org.cn", "twitter.com" , "baidu.com", "google.com", "sina.com", "weibo.com" , "github.com", "sina.com.cn", "sina.cn", "edu.cn", "wordpress.org", "sephora.com"}; ignoreSet.addAll(Arrays.asList(ignores)); loadIgnore(); loadWord(); } private Map<String, String> defaultHeaders() { Map<String, String> ans = new HashMap<>(); ans.put("Accept", "application/json, text/plain, */*"); ans.put("Content-Type", "application/json"); ans.put("Connection", "keep-alive"); ans.put("Accept-Language", "zh-CN,zh;q=0.9"); ans.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"); return ans; } public SendReq.ResBody doRequest(String url, String method, Map<String, Object> params) { String urlTrue = url; SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders()); return resBody; } public void loadIgnore() { File directory = new File("."); try { String file = directory.getCanonicalPath() + "/moneyframework/generate/ignore/demo.txt"; BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(file)))); String line = ""; while ((line = reader.readLine()) != null) { String x = line.replace("[", "").replace("]", "").replace(" ", ""); String[] y = x.split(","); ignoreSet.addAll(Arrays.asList(y)); } } catch (IOException e) { e.printStackTrace(); } } public void loadDomains(String file) { File directory = new File("."); try { File file1 = new File(directory.getCanonicalPath() + "//" + file); logger.info(directory.getCanonicalPath() + "//" + file); if (!file1.exists()) { file1.createNewFile(); } BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file1))); String line = ""; while ((line = reader.readLine()) != null) { line = line.trim(); OnePage onePage = new OnePage(line); if (!oldDomains.contains(onePage.getDomain())) { onePageMap.put(onePage.getDomain(), onePage); oldDomains.add(onePage.getDomain()); } } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public void handleWord(List<String> s, String domain, String title) { for (String a : s) { String x = a.split(" ")[0]; String y = a.split(" ")[1]; Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (Integer.parseInt(y) >= 10) { if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } } Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title); for (Word word : xxxx) { String x = word.getWord(); Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } } public void loadWord() { File directory = new File("."); try { File file1 = new File(directory.getCanonicalPath() + "//moneyframework/domain/markdown"); if (file1.isDirectory()) { int fileCnt = 0; File[] files = file1.listFiles(); for (File file : files) { fileCnt ++; try { BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file))); String line = ""; siteMaps.put(file.getName().replace(".md", ""), new ConcurrentSkipListSet<>()); while ((line = reader.readLine()) != null) { line = line.trim(); if (line.startsWith("####")) { siteMaps.get(file.getName().replace(".md", "")).add(line.replace("#### ", "").trim()); } } }catch (Exception e){ } if ((fileCnt % 1000 ) == 0){ logger.info((fileCnt * 100.0) / files.length + "%"); } } } for (Map.Entry<String,Set<String>> xxx : siteMaps.entrySet()){ oldDomains.addAll(xxx.getValue()); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public void doTask() { String root = "http://" + this.domain + "/"; Queue<String> urls = new LinkedList<>(); urls.add(root); Set<String> tmpDomains = new HashSet<>(); tmpDomains.addAll(oldDomains); tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root)); int cnt = 0; while (!urls.isEmpty()) { String url = urls.poll(); SendReq.ResBody html = doRequest(url, "GET", new HashMap<>()); cnt++; if (html.getCode().equals(0)) { ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url)); try { GenerateFile.createFile2("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString()); } catch (IOException e) { e.printStackTrace(); } continue; } OnePage onePage = new OnePage(); onePage.setUrl(url); onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url)); onePage.setCode(html.getCode()); String title = HtmlUtil.getTitle(html.getResponce()).trim(); if (!StringUtils.hasText(title) || title.length() > 100 || title.contains("?")) { title = "沒有"; } onePage.setTitle(title); String content = HtmlUtil.getContent(html.getResponce()); Set<Word> words = Nlp.separateWordAndReturnUnit(content); List<String> wordStr = Nlp.print2List(new ArrayList<>(words), 10); handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title); onePage.setContent(wordStr.toString()); if (html.getCode().equals(200)) { List<String> domains = HtmlUtil.getUrls(html.getResponce()); for (String domain : domains) { int flag = 0; String[] aaa = domain.split("."); if (aaa.length>=4){ continue; } for (String i : ignoreSet) { if (domain.endsWith(i)) { flag = 1; break; } } if (flag == 1) continue; if (StringUtils.hasText(domain.trim())) { if (!tmpDomains.contains(domain)) { tmpDomains.add(domain); urls.add("http://" + domain + "/"); } } } logger.info(this.domain + " 隊列的大小為 " + urls.size()); if (cnt >= 2000) { break; } } else { if (url.startsWith("http:")){ urls.add(url.replace("http:","https:")); } } } }}其中,這里的_SendReq.sendReq_是自己實現(xiàn)的一個下載頁面你的方法,調(diào)用了HttpClient的方法。如果你想實現(xiàn)對Web2.0的抓取,可以考慮在里面封裝一個PlayWrite。 然后是格式化Html,去除標簽和由于特殊字符引起的各種亂碼的工具類HtmlUtils。

import org.apache.commons.lang3.StringEscapeUtils;import java.io.IOException;import java.nio.charset.StandardCharsets;import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set;import java.util.regex.Matcher;import java.util.regex.Pattern;public class HtmlUtil { public static String getContent(String html) { String ans = ""; try { html = StringEscapeUtils.unescapeHtml4(html); html = delHTMLTag(html); html = htmlTextFormat(html); return html; } catch (Exception e) { e.printStackTrace(); } return ans; } public static String delHTMLTag(String htmlStr) { String regEx_script = "<script[^>]*?>[//s//S]*?<///script>"; //定義script的正則表達式 String regEx_style = "<style[^>]*?>[//s//S]*?<///style>"; //定義style的正則表達式 String regEx_html = "<[^>]+>"; //定義HTML標簽的正則表達式 Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE); Matcher m_script = p_script.matcher(htmlStr); htmlStr = m_script.replaceAll(""); //過濾script標簽 Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE); Matcher m_style = p_style.matcher(htmlStr); htmlStr = m_style.replaceAll(""); //過濾style標簽 Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE); Matcher m_html = p_html.matcher(htmlStr); htmlStr = m_html.replaceAll(""); //過濾html標簽 return htmlStr.trim(); } public static String htmlTextFormat(String htmlText) { return htmlText .replaceAll(" +", " ") .replaceAll("/n", " ") .replaceAll("/r", " ") .replaceAll("/t", " ") .replaceAll("  "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" "," ") .replaceAll(" ?"," ") .replaceAll(" ? "," ") .replaceAll("??"," ") .replaceAll("??"," ") .replaceAll(" "," ") .replaceAll("!!"," ") .replaceAll("? "," "); } public static List<String> getUrls(String htmlText) { Pattern pattern = Pattern.compile("(http|https)://////[A-Za-z0-9_//-//+.:?&@=///%#,;]*"); Matcher matcher = pattern.matcher(htmlText); Set<String> ans = new HashSet<>(); while (matcher.find()) { ans.add(DomainUtils.getDomainWithCompleteDomain(matcher.group())); } return new ArrayList<>(ans); } public static String getTitle(String htmlText) { Pattern pattern = Pattern.compile("(?<=title//>).*(?=</title)"); Matcher matcher = pattern.matcher(htmlText); Set<String> ans = new HashSet<>(); while (matcher.find()) { return matcher.group(); } return ""; }}除了上面提到的去除標簽和特殊字符外,還實現(xiàn)了獲取所有url和標題的方法(Java有一些庫也提供了相同的方法)。 ?

存儲

在上面的代碼中,其實包含了存儲和分析的調(diào)用代碼,現(xiàn)在單獨拎出來分析一下。

public void handleWord(List<String> s, String domain, String title) { for (String a : s) { String x = a.split(" ")[0]; String y = a.split(" ")[1]; Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (Integer.parseInt(y) >= 10) { if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } } Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title); for (Word word : xxxx) { String x = word.getWord(); Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } }存儲的方法就是這個handleWord,其中,這里的s就是某個頁面的分詞結(jié)果(這里沒有存儲詞語出現(xiàn)的偏移量,所以也不算是倒排索引),domain是域名本身,title是標題。 其中,這里調(diào)用了GenerateFile,是自定義實現(xiàn)的創(chuàng)建文件工具類。部分代碼如下:

public static void createFileRecursion(String fileName, Integer height) throws IOException { Path path = Paths.get(fileName); if (Files.exists(path)) { // TODO: 2021/11/13 如果文件存在 return; } if (Files.exists(path.getParent())) { // TODO: 2021/11/13 如果父級文件存在,直接創(chuàng)建文件 if (height == 0) { Files.createFile(path); } else { Files.createDirectory(path); } } else { createFileRecursion(path.getParent().toString(), height + 1); // TODO: 2021/11/13 這一步能保證path的父級一定存在了,現(xiàn)在需要把自己也建一下 createFileRecursion(fileName, height); }}public static void appendFileWithRelativePath(String folder, String fileName, String value) { File directory = new File("."); try { fileName = directory.getCanonicalPath() + "/" + folder + "/" + fileName; createFileRecursion(fileName, 0); } catch (IOException e) { e.printStackTrace(); } try { BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(fileName, true)); bufferedOutputStream.write(value.getBytes()); bufferedOutputStream.flush(); bufferedOutputStream.close(); } catch (IOException e) { e.printStackTrace(); }}

分析

這里的分析主要是對處理之后的網(wǎng)頁內(nèi)容進行分詞和詞頻統(tǒng)計,這里使用的仍舊是之前推薦的Hanlp。

import com.chaojilaji.moneyframework.model.Word;import com.hankcs.hanlp.HanLP;import com.hankcs.hanlp.seg.Segment;import com.hankcs.hanlp.seg.common.Term;import java.util.*;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Nlp { private static Pattern ignoreWords = Pattern.compile("[,.0-9_//- ,、:。;;//]//[///?。ǎ尽??“”()+:|/"%~<>——]+"); public static Set<Word> separateWordAndReturnUnit(String text) { Segment segment = HanLP.newSegment().enableOffset(true); Set<Word> detectorUnits = new HashSet<>(); Map<Integer, Word> detectorUnitMap = new HashMap<>(); List<Term> terms = segment.seg(text); for (Term term : terms) { Matcher matcher = ignoreWords.matcher(term.word); if (!matcher.find() && term.word.length() > 1 && !term.word.contains("?")) { Integer hashCode = term.word.hashCode(); Word detectorUnit = detectorUnitMap.get(hashCode); if (Objects.nonNull(detectorUnit)) { detectorUnit.setCount(detectorUnit.getCount() + 1); } else { detectorUnit = new Word(); detectorUnit.setWord(term.word.trim()); detectorUnit.setCount(1); detectorUnitMap.put(hashCode, detectorUnit); detectorUnits.add(detectorUnit); } } } return detectorUnits; } public static List<String> print2List(List<Word> tmp,int cnt){ PriorityQueue<Word> words = new PriorityQueue<>(); List<String> ans = new ArrayList<>(); for (Word word : tmp) { words.add(word); } int count = 0; while (!words.isEmpty()) { Word word = words.poll(); if (word.getCount()<50){ ans.add(word.getWord() + " " + word.getCount()); count ++; if (count >= cnt){ break; } } } return ans; }}其中,separateWordAndReturnUnit是對文本進行分詞和進行詞頻統(tǒng)計,其結(jié)構(gòu)如下:

public class Word implements Comparable{ private String word; private Integer count = 0; ... ... @Override public int compareTo(Object o) { if (this.count >= ((Word)o).count){ return -1; }else { return 1; } }}print2List方法是為了對List進行排序后輸出,直接使用自帶的排序方法也可以,這里使用優(yōu)先隊列的目的是覺得可能大頂堆的時間復(fù)雜度比快排低一些,不過這里的數(shù)據(jù)量不大,優(yōu)化過頭了。 ?

搜索實現(xiàn)

搜索實現(xiàn),本質(zhì)上就是利用兩個HashMap以兩種不同的維度去看待這些結(jié)果,分別以網(wǎng)站域名的角度和詞語的角度看待。那么我在使用谷歌插件實現(xiàn)展示的時候,就可以做兩個功能

加載的代碼比較生硬,就是簡單的文件讀取,字符串處理,這里就不貼了。不過這里有一個值得注意的點就是需要定期去重啟加載,因為內(nèi)容是變化的,爬蟲一直在寫入數(shù)據(jù),而這里也需要進行反饋,告訴爬蟲,哪些網(wǎng)站可以成為新的目標去爬取。 ?

需要提供如下的方法提供給插件

@GetMapping("/api/v1/keywords")@ResponseBodypublic String getKeyWords(String domain) { try { Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain)); if (Objects.nonNull(site)) { String keyWords = site.getKeywords(); keyWords = keyWords.replace("[", "").replace("]", ""); String[] keyWordss = keyWords.split(", "); StringBuffer ans = new StringBuffer(); for (int i = 0; i < keyWordss.length; i++) { ans.append(keyWordss[i]).append("/n"); } return ans.toString(); } } catch (Exception e) { } return "該網(wǎng)站沒有入庫";}@GetMapping("/api/v1/relations")@ResponseBodypublic String getRelationDomain(String domain) { try { Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain)); String keyWords = site.getKeywords(); keyWords = keyWords.replace("[", "").replace("]", ""); String[] keyWordss = keyWords.split(", "); Set<String> tmp = new HashSet<>(); int cnt = 0; for (int i = 0; i < keyWordss.length; i++) { String keyword = keyWordss[i]; String key = keyword.split(" ")[0]; if (IgnoreUtils.checkIgnore(key)) continue; cnt++; Set<String> x = demoService.siteMaps.get(key); if (Objects.nonNull(x)) { for (String y : x) { String yy = demoService.stringSiteMap.get(y).getKeywords(); int l = yy.indexOf(key); if (l != -1) { String yyy = ""; int flag = 0; for (int j = l; j < yy.length(); j++) { if (yy.charAt(j) == ',' || yy.charAt(j) == ']') { break; } if (flag == 1) { yyy = yyy + yy.charAt(j); } if (yy.charAt(j) == ' ') { flag = 1; } } if (Integer.parseInt(yyy) >= 20) { tmp.add(y + "----" + key + "----" + yyy); } } else { // Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key); // if (titleContains) { // tmp.add(y + "----" + key + "----標題含有"); // } } } } if (cnt >= 4) { break; } } StringBuffer ans = new StringBuffer(); for (String s : tmp) { ans.append("<a href=/"http://" + s.split("----")[0] + "/">" + s + "</a><br>"); } return ans.toString(); } catch (Exception e) { // e.printStackTrace(); } return "該網(wǎng)站暫無相似網(wǎng)站";}@GetMapping("/api/v1/keyresult")@ResponseBodypublic String getKeyResult(String key, String key2, String key3,Integer page, Integer size) { Set<String> x = new HashSet<>(demoService.siteMaps.get(key)); if (StringUtils.hasText(key2)) { key2 = key2.trim(); if (StringUtils.hasText(key2)){ Set<String> x2 = demoService.siteMaps.get(key2); x.retainAll(x2); } } if (StringUtils.hasText(key3)) { key3 = key3.trim(); if (StringUtils.hasText(key3)){ Set<String> x3 = demoService.siteMaps.get(key3); x.retainAll(x3); } } if (Objects.nonNull(x) && x.size() > 0) { Set<String> tmp = new HashSet<>(); for (String y : x) { String yy = demoService.stringSiteMap.get(y).getKeywords(); int l = yy.indexOf(key); if (l != -1) { String yyy = ""; int flag = 0; for (int j = l; j < yy.length(); j++) { if (yy.charAt(j) == ',') { break; } if (flag == 1) { yyy = yyy + yy.charAt(j); } if (yy.charAt(j) == ' ') { flag = 1; } } tmp.add(y + "----" + demoService.stringSiteMap.get(y).getTitle() + "----" + key + "----" + yyy); } else { Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key); if (titleContains) { tmp.add(y + "----" + demoService.stringSiteMap.get(y).getTitle() + "----" + key + "----標題含有"); } } } StringBuffer ans = new StringBuffer(); List<String> temp = new ArrayList<>(tmp); for (int i = (page - 1) * size; i < temp.size() && i < page * size; i++) { String s = temp.get(i); ans.append("<a href=/"http://" + s.split("----")[0] + "/" style=/"font-size: 20px/">" + s.split("----")[1] + "</a> <p style=/"font-size: 15px/">" + s.split("----")[0] + "&nbsp;&nbsp;&nbsp;" + s.split("----")[3] + "</p><hr color=/"silver/" size=1/>"); } return ans.toString(); } return "暫未收錄";}@GetMapping("/api/v1/demo")@ResponseBodypublic void demo(String key) { new Thread(new Runnable() { @Override public void run() { HttpClientCrawl clientCrawl = new HttpClientCrawl(key); try { clientCrawl.doTask(); } catch (Exception e) { e.printStackTrace(); } finally { clientCrawl.oldDomains.clear(); clientCrawl.siteMaps.clear(); clientCrawl.onePageMap.clear(); clientCrawl.ignoreSet.clear(); } } }).start();}這是一個非正式的項目,所以寫得比較簡陋和隨意,見諒。 ?

展現(xiàn)

展現(xiàn)部分使用的是谷歌插件,最快捷的方式就是去github上下一個線程的插件來修修補補實現(xiàn)。跑成功了再去深入研究原理和奧秘。傳送門 ?

中間的實現(xiàn)過程就和寫個普通的頁面沒啥區(qū)別,所以省略了。 ?

最終的結(jié)果如下:

然后搜索部分如下:

在這里我想拜托各位朋友一件事:如果你們有收藏了好久的技術(shù)網(wǎng)站,可以分享在評論區(qū),我目前陷入了抓取目標瓶頸。各行各業(yè)都可以。

關(guān)鍵詞:索引,領(lǐng)域,垂直

74
73
25
news

版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。

為了最佳展示效果,本站不支持IE9及以下版本的瀏覽器,建議您使用谷歌Chrome瀏覽器。 點擊下載Chrome瀏覽器
關(guān)閉