理論架構(gòu)想要實現(xiàn)一個搜索引擎，首先需要考慮出完整的架構(gòu)。

頁面抓取存儲分析搜索實現(xiàn)展現(xiàn)頁面抓取首先，頁面抓取我打算" />

国产成人精品无码青草_亚洲国产美女精品久久久久∴_欧美人与鲁交大毛片免费_国产果冻豆传媒麻婆精东

首頁

解決方案&服務(wù)

客戶&案例

營銷資訊

關(guān)于我們

18143453325 或

在線咨詢

所在位置：首頁 > 營銷資訊 > 電子商務(wù) > 手寫一個垂直領(lǐng)域的搜索引擎

手寫一個垂直領(lǐng)域的搜索引擎

時間：2023-03-20 06:22:01 | 來源：電子商務(wù)

時間：2023-03-20 06:22:01 來源：電子商務(wù)

前言：想要谷歌和百度已經(jīng)夠用了，這里實現(xiàn)的搜索只是為了方便自己做后續(xù)的事情的一個小實踐。 ?

理論架構(gòu)

想要實現(xiàn)一個搜索引擎，首先需要考慮出完整的架構(gòu)。

頁面抓取
存儲
分析
搜索實現(xiàn)
展現(xiàn)

頁面抓取

首先，頁面抓取我打算采取最簡單的HttpClient的方式，可能有人會說，你這樣做會漏掉大量使用Web2.0的網(wǎng)站的，是的，沒錯，最開始我為了驗證架構(gòu)的可用性，就是要漏掉一些復(fù)雜的點。

存儲

然后，存儲，我打算直接使用文件系統(tǒng)進行實體存儲，在搜索使用的時候，全部將結(jié)果加載到內(nèi)存中?？赡苡械娜藭f，你這樣好消耗內(nèi)存哦，是的，沒錯，我可以用大量的swap空間，用性能換內(nèi)存。

分析

分析部分，我打算直接使用分詞算法，解析出詞頻，建立文章的倒排索引，但是不存儲文章的所有詞語的倒排索引，畢竟要考慮到未優(yōu)化的文件系統(tǒng)的存取性能。我這里的方案是直接取詞頻在20~50范圍內(nèi)的詞以及網(wǎng)站標題的分詞結(jié)果作為網(wǎng)站的關(guān)鍵詞，建立倒排系統(tǒng)而存在。為了描述不顯得那么空白和抽象，這里貼出最后的結(jié)構(gòu)：

文件的標題名就是分詞的詞語名，文件里面存儲的是所有關(guān)鍵詞有該詞的網(wǎng)站域名，如下：

有點類似elasticsearch底層的存儲原理，不過我沒有做什么優(yōu)化。

搜索實現(xiàn)

搜索實現(xiàn)部分，我打算直接將上述文件加載到內(nèi)存中，直接使用HashMap存儲，方便讀取。

展現(xiàn)

為了方便隨點隨用，我打算直接使用谷歌瀏覽器插件的形式進行展現(xiàn)實現(xiàn)。 ?

好了，現(xiàn)在理論架構(gòu)差不多了，那么就開始動手實現(xiàn)吧

動手實現(xiàn)

頁面抓取

剛才提到了，這里直接使用HttpClient進行頁面抓取，除此之外，還涉及對頁面的外鏈解析。在說外鏈解析之前，我打算先說說我的抓取思路。 ?

將整個互聯(lián)網(wǎng)想象成一張巨大的網(wǎng)，網(wǎng)站間通過鏈接的方式相互串聯(lián)，雖然這里面有大量的網(wǎng)站是孤島，但是不妨礙對絕大多數(shù)網(wǎng)站的抓取。所以這里采取的方案就是多點為主要節(jié)點的廣度優(yōu)先遍歷，對單個網(wǎng)站只抓取首頁的內(nèi)容，分析其中的所有外鏈，然后作為目標進行抓取。 ?

抓取頁面的代碼如下：

import com.chaojilaji.auto.autocode.generatecode.GenerateFile;import com.chaojilaji.auto.autocode.standartReq.SendReq;import com.chaojilaji.auto.autocode.utils.Json;import com.chaojilaji.moneyframework.model.OnePage;import com.chaojilaji.moneyframework.model.Word;import com.chaojilaji.moneyframework.service.Nlp;import com.chaojilaji.moneyframework.utils.DomainUtils;import com.chaojilaji.moneyframework.utils.HtmlUtil;import com.chaojilaji.moneyframework.utils.MDUtils;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.springframework.stereotype.Service;import org.springframework.util.StringUtils;import java.io.*;import java.util.*;import java.util.concurrent.ConcurrentHashMap;import java.util.concurrent.ConcurrentSkipListSet;public class HttpClientCrawl {    private static Log logger = LogFactory.getLog(HttpClientCrawl.class);    public Set<String> oldDomains = new ConcurrentSkipListSet<>();    public Map<String, OnePage> onePageMap = new ConcurrentHashMap<>(400000);    public Set<String> ignoreSet = new ConcurrentSkipListSet<>();    public Map<String, Set<String>> siteMaps = new ConcurrentHashMap<>(50000);    public String domain;    public HttpClientCrawl(String domain) {        this.domain = DomainUtils.getDomainWithCompleteDomain(domain);        String[] ignores = {"gov.cn", "apac.cn", "org.cn", "twitter.com"                , "baidu.com", "google.com", "sina.com", "weibo.com"                , "github.com", "sina.com.cn", "sina.cn", "edu.cn", "wordpress.org", "sephora.com"};        ignoreSet.addAll(Arrays.asList(ignores));        loadIgnore();        loadWord();    }    private Map<String, String> defaultHeaders() {        Map<String, String> ans = new HashMap<>();        ans.put("Accept", "application/json, text/plain, */*");        ans.put("Content-Type", "application/json");        ans.put("Connection", "keep-alive");        ans.put("Accept-Language", "zh-CN,zh;q=0.9");        ans.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36");        return ans;    }    public SendReq.ResBody doRequest(String url, String method, Map<String, Object> params) {        String urlTrue = url;        SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders());        return resBody;    }    public void loadIgnore() {        File directory = new File(".");        try {            String file = directory.getCanonicalPath() + "/moneyframework/generate/ignore/demo.txt";            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(file))));            String line = "";            while ((line = reader.readLine()) != null) {                String x = line.replace("[", "").replace("]", "").replace(" ", "");                String[] y = x.split(",");                ignoreSet.addAll(Arrays.asList(y));            }        } catch (IOException e) {            e.printStackTrace();        }    }    public void loadDomains(String file) {        File directory = new File(".");        try {            File file1 = new File(directory.getCanonicalPath() + "//" + file);            logger.info(directory.getCanonicalPath() + "//" + file);            if (!file1.exists()) {                file1.createNewFile();            }            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file1)));            String line = "";            while ((line = reader.readLine()) != null) {                line = line.trim();                OnePage onePage = new OnePage(line);                if (!oldDomains.contains(onePage.getDomain())) {                    onePageMap.put(onePage.getDomain(), onePage);                    oldDomains.add(onePage.getDomain());                }            }        } catch (FileNotFoundException e) {            e.printStackTrace();        } catch (IOException e) {            e.printStackTrace();        }    }    public void handleWord(List<String> s, String domain, String title) {        for (String a : s) {            String x = a.split(" ")[0];            String y = a.split(" ")[1];            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());            if (Integer.parseInt(y) >= 10) {                if (z.contains(domain)) continue;                z.add(domain);                siteMaps.put(x, z);                GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));            }        }        Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title);        for (Word word : xxxx) {            String x = word.getWord();            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());            if (z.contains(domain)) continue;            z.add(domain);            siteMaps.put(x, z);            GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));        }    }    public void loadWord() {        File directory = new File(".");        try {            File file1 = new File(directory.getCanonicalPath() + "//moneyframework/domain/markdown");            if (file1.isDirectory()) {                int fileCnt = 0;                File[] files = file1.listFiles();                for (File file : files) {                    fileCnt ++;                    try {                        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));                        String line = "";                        siteMaps.put(file.getName().replace(".md", ""), new ConcurrentSkipListSet<>());                        while ((line = reader.readLine()) != null) {                            line = line.trim();                            if (line.startsWith("####")) {                                siteMaps.get(file.getName().replace(".md", "")).add(line.replace("#### ", "").trim());                            }                        }                    }catch (Exception e){                    }                    if ((fileCnt % 1000 ) == 0){                        logger.info((fileCnt * 100.0) / files.length + "%");                    }                }            }            for (Map.Entry<String,Set<String>> xxx : siteMaps.entrySet()){                oldDomains.addAll(xxx.getValue());            }        } catch (FileNotFoundException e) {            e.printStackTrace();        } catch (IOException e) {            e.printStackTrace();        }    }    public void doTask() {        String root = "http://" + this.domain + "/";        Queue<String> urls = new LinkedList<>();        urls.add(root);        Set<String> tmpDomains = new HashSet<>();        tmpDomains.addAll(oldDomains);        tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root));        int cnt = 0;        while (!urls.isEmpty()) {            String url = urls.poll();            SendReq.ResBody html = doRequest(url, "GET", new HashMap<>());            cnt++;            if (html.getCode().equals(0)) {                ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url));                try {                    GenerateFile.createFile2("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString());                } catch (IOException e) {                    e.printStackTrace();                }                continue;            }            OnePage onePage = new OnePage();            onePage.setUrl(url);            onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url));            onePage.setCode(html.getCode());            String title = HtmlUtil.getTitle(html.getResponce()).trim();            if (!StringUtils.hasText(title) || title.length() > 100 || title.contains("?")) {                title = "沒有";            }            onePage.setTitle(title);            String content = HtmlUtil.getContent(html.getResponce());            Set<Word> words = Nlp.separateWordAndReturnUnit(content);            List<String> wordStr = Nlp.print2List(new ArrayList<>(words), 10);            handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title);            onePage.setContent(wordStr.toString());            if (html.getCode().equals(200)) {                List<String> domains = HtmlUtil.getUrls(html.getResponce());                for (String domain : domains) {                    int flag = 0;                    String[] aaa = domain.split(".");                    if (aaa.length>=4){                        continue;                    }                    for (String i : ignoreSet) {                        if (domain.endsWith(i)) {                            flag = 1;                            break;                        }                    }                    if (flag == 1) continue;                    if (StringUtils.hasText(domain.trim())) {                        if (!tmpDomains.contains(domain)) {                            tmpDomains.add(domain);                            urls.add("http://" + domain + "/");                        }                    }                }                logger.info(this.domain + " 隊列的大小為 " + urls.size());                if (cnt >= 2000) {                    break;                }            } else {                if (url.startsWith("http:")){                    urls.add(url.replace("http:","https:"));                }            }        }    }}

其中，這里的_SendReq.sendReq_是自己實現(xiàn)的一個下載頁面你的方法，調(diào)用了HttpClient的方法。如果你想實現(xiàn)對Web2.0的抓取，可以考慮在里面封裝一個PlayWrite。然后是格式化Html，去除標簽和由于特殊字符引起的各種亂碼的工具類HtmlUtils。

import org.apache.commons.lang3.StringEscapeUtils;import java.io.IOException;import java.nio.charset.StandardCharsets;import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set;import java.util.regex.Matcher;import java.util.regex.Pattern;public class HtmlUtil {    public static String getContent(String html) {        String ans = "";        try {            html = StringEscapeUtils.unescapeHtml4(html);            html = delHTMLTag(html);            html = htmlTextFormat(html);            return html;        } catch (Exception e) {            e.printStackTrace();        }        return ans;    }    public static String delHTMLTag(String htmlStr) {        String regEx_script = "<script[^>]*?>[//s//S]*?<///script>"; //定義script的正則表達式        String regEx_style = "<style[^>]*?>[//s//S]*?<///style>"; //定義style的正則表達式        String regEx_html = "<[^>]+>"; //定義HTML標簽的正則表達式        Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);        Matcher m_script = p_script.matcher(htmlStr);        htmlStr = m_script.replaceAll(""); //過濾script標簽        Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);        Matcher m_style = p_style.matcher(htmlStr);        htmlStr = m_style.replaceAll(""); //過濾style標簽        Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);        Matcher m_html = p_html.matcher(htmlStr);        htmlStr = m_html.replaceAll(""); //過濾html標簽        return htmlStr.trim();    }    public static String htmlTextFormat(String htmlText) {        return htmlText                .replaceAll(" +", " ")                .replaceAll("/n", " ")                .replaceAll("/r", " ")                .replaceAll("/t", " ")                .replaceAll("　　"," ")                .replaceAll("                 "," ")                .replaceAll("              "," ")                .replaceAll("            "," ")                .replaceAll("        "," ")                .replaceAll("       "," ")                .replaceAll("      "," ")                .replaceAll("     "," ")                .replaceAll("    "," ")                .replaceAll("   "," ")                .replaceAll("  "," ")                .replaceAll(" ?"," ")                .replaceAll(" ? "," ")                .replaceAll("??"," ")                .replaceAll("??"," ")                .replaceAll("  "," ")                .replaceAll("!!"," ")                .replaceAll("?        "," ");    }    public static List<String> getUrls(String htmlText) {        Pattern pattern = Pattern.compile("(http|https)://////[A-Za-z0-9_//-//+.:?&@=///%#,;]*");        Matcher matcher = pattern.matcher(htmlText);        Set<String> ans = new HashSet<>();        while (matcher.find()) {            ans.add(DomainUtils.getDomainWithCompleteDomain(matcher.group()));        }        return new ArrayList<>(ans);    }    public static String getTitle(String htmlText) {        Pattern pattern = Pattern.compile("(?<=title//>).*(?=</title)");        Matcher matcher = pattern.matcher(htmlText);        Set<String> ans = new HashSet<>();        while (matcher.find()) {            return matcher.group();        }        return "";    }}

除了上面提到的去除標簽和特殊字符外，還實現(xiàn)了獲取所有url和標題的方法（Java有一些庫也提供了相同的方法）。 ?

存儲

在上面的代碼中，其實包含了存儲和分析的調(diào)用代碼，現(xiàn)在單獨拎出來分析一下。

public void handleWord(List<String> s, String domain, String title) {        for (String a : s) {            String x = a.split(" ")[0];            String y = a.split(" ")[1];            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());            if (Integer.parseInt(y) >= 10) {                if (z.contains(domain)) continue;                z.add(domain);                siteMaps.put(x, z);                GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));            }        }        Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title);        for (Word word : xxxx) {            String x = word.getWord();            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());            if (z.contains(domain)) continue;            z.add(domain);            siteMaps.put(x, z);            GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));        }    }

存儲的方法就是這個handleWord，其中，這里的s就是某個頁面的分詞結(jié)果（這里沒有存儲詞語出現(xiàn)的偏移量，所以也不算是倒排索引），domain是域名本身，title是標題。其中，這里調(diào)用了GenerateFile，是自定義實現(xiàn)的創(chuàng)建文件工具類。部分代碼如下：

public static void createFileRecursion(String fileName, Integer height) throws IOException {    Path path = Paths.get(fileName);    if (Files.exists(path)) {        // TODO: 2021/11/13 如果文件存在        return;    }    if (Files.exists(path.getParent())) {        // TODO: 2021/11/13 如果父級文件存在，直接創(chuàng)建文件        if (height == 0) {            Files.createFile(path);        } else {            Files.createDirectory(path);        }    } else {        createFileRecursion(path.getParent().toString(), height + 1);        // TODO: 2021/11/13 這一步能保證path的父級一定存在了，現(xiàn)在需要把自己也建一下        createFileRecursion(fileName, height);    }}public static void appendFileWithRelativePath(String folder, String fileName, String value) {    File directory = new File(".");    try {        fileName = directory.getCanonicalPath() + "/" + folder + "/" + fileName;        createFileRecursion(fileName, 0);    } catch (IOException e) {        e.printStackTrace();    }    try {        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(fileName, true));        bufferedOutputStream.write(value.getBytes());        bufferedOutputStream.flush();        bufferedOutputStream.close();    } catch (IOException e) {        e.printStackTrace();    }}

分析

這里的分析主要是對處理之后的網(wǎng)頁內(nèi)容進行分詞和詞頻統(tǒng)計，這里使用的仍舊是之前推薦的Hanlp。

import com.chaojilaji.moneyframework.model.Word;import com.hankcs.hanlp.HanLP;import com.hankcs.hanlp.seg.Segment;import com.hankcs.hanlp.seg.common.Term;import java.util.*;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Nlp {    private static Pattern ignoreWords = Pattern.compile("[,.0-9_//- ，、：。；;//]//[///?。ǎ尽?？“”()+:|/"%~<>——]+");    public static Set<Word> separateWordAndReturnUnit(String text) {        Segment segment = HanLP.newSegment().enableOffset(true);        Set<Word> detectorUnits = new HashSet<>();        Map<Integer, Word> detectorUnitMap = new HashMap<>();        List<Term> terms = segment.seg(text);        for (Term term : terms) {            Matcher matcher = ignoreWords.matcher(term.word);            if (!matcher.find() && term.word.length() > 1 && !term.word.contains("?")) {                Integer hashCode = term.word.hashCode();                Word detectorUnit = detectorUnitMap.get(hashCode);                if (Objects.nonNull(detectorUnit)) {                    detectorUnit.setCount(detectorUnit.getCount() + 1);                } else {                    detectorUnit = new Word();                    detectorUnit.setWord(term.word.trim());                    detectorUnit.setCount(1);                    detectorUnitMap.put(hashCode, detectorUnit);                    detectorUnits.add(detectorUnit);                }            }        }        return detectorUnits;    }    public static List<String> print2List(List<Word> tmp,int cnt){        PriorityQueue<Word> words = new PriorityQueue<>();        List<String> ans = new ArrayList<>();        for (Word word : tmp) {            words.add(word);        }        int count = 0;        while (!words.isEmpty()) {            Word word = words.poll();            if (word.getCount()<50){                ans.add(word.getWord() + " " + word.getCount());                count ++;                if (count >= cnt){                    break;                }            }        }        return ans;    }}

其中，separateWordAndReturnUnit是對文本進行分詞和進行詞頻統(tǒng)計，其結(jié)構(gòu)如下：

public class Word implements Comparable{    private String word;    private Integer count = 0;    ... ...    @Override    public int compareTo(Object o) {        if (this.count >= ((Word)o).count){            return -1;        }else {            return 1;        }    }}

print2List方法是為了對List進行排序后輸出，直接使用自帶的排序方法也可以，這里使用優(yōu)先隊列的目的是覺得可能大頂堆的時間復(fù)雜度比快排低一些，不過這里的數(shù)據(jù)量不大，優(yōu)化過頭了。 ?

搜索實現(xiàn)

搜索實現(xiàn)，本質(zhì)上就是利用兩個HashMap以兩種不同的維度去看待這些結(jié)果，分別以網(wǎng)站域名的角度和詞語的角度看待。那么我在使用谷歌插件實現(xiàn)展示的時候，就可以做兩個功能

在每個網(wǎng)站右上角點擊插件的時候，讀取到當前的網(wǎng)站的關(guān)鍵詞情況，已經(jīng)得到其相關(guān)網(wǎng)站
在插件選項里面做多關(guān)鍵詞搜索

加載的代碼比較生硬，就是簡單的文件讀取，字符串處理，這里就不貼了。不過這里有一個值得注意的點就是需要定期去重啟加載，因為內(nèi)容是變化的，爬蟲一直在寫入數(shù)據(jù)，而這里也需要進行反饋，告訴爬蟲，哪些網(wǎng)站可以成為新的目標去爬取。 ?

需要提供如下的方法提供給插件

通過域名獲取網(wǎng)站關(guān)鍵詞

@GetMapping("/api/v1/keywords")@ResponseBodypublic String getKeyWords(String domain) {    try {        Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain));        if (Objects.nonNull(site)) {            String keyWords = site.getKeywords();            keyWords = keyWords.replace("[", "").replace("]", "");            String[] keyWordss = keyWords.split(", ");            StringBuffer ans = new StringBuffer();            for (int i = 0; i < keyWordss.length; i++) {                ans.append(keyWordss[i]).append("/n");            }            return ans.toString();        }    } catch (Exception e) {    }    return "該網(wǎng)站沒有入庫";}

通過域名獲取相似網(wǎng)站

@GetMapping("/api/v1/relations")@ResponseBodypublic String getRelationDomain(String domain) {    try {        Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain));        String keyWords = site.getKeywords();        keyWords = keyWords.replace("[", "").replace("]", "");        String[] keyWordss = keyWords.split(", ");        Set<String> tmp = new HashSet<>();        int cnt = 0;        for (int i = 0; i < keyWordss.length; i++) {            String keyword = keyWordss[i];            String key = keyword.split(" ")[0];            if (IgnoreUtils.checkIgnore(key)) continue;            cnt++;            Set<String> x = demoService.siteMaps.get(key);            if (Objects.nonNull(x)) {                for (String y : x) {                    String yy = demoService.stringSiteMap.get(y).getKeywords();                    int l = yy.indexOf(key);                    if (l != -1) {                        String yyy = "";                        int flag = 0;                        for (int j = l; j < yy.length(); j++) {                            if (yy.charAt(j) == ',' || yy.charAt(j) == ']') {                                break;                            }                            if (flag == 1) {                                yyy = yyy + yy.charAt(j);                            }                            if (yy.charAt(j) == ' ') {                                flag = 1;                            }                        }                        if (Integer.parseInt(yyy) >= 20) {                            tmp.add(y + "----" + key + "----" + yyy);                        }                    } else {                        //                            Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key);                        //                            if (titleContains) {                        //                                tmp.add(y + "----" + key + "----標題含有");                        //                            }                    }                }            }            if (cnt >= 4) {                break;            }        }        StringBuffer ans = new StringBuffer();        for (String s : tmp) {            ans.append("<a href=/"http://" + s.split("----")[0] + "/">" + s + "</a><br>");        }        return ans.toString();    } catch (Exception e) {        //            e.printStackTrace();    }    return "該網(wǎng)站暫無相似網(wǎng)站";}

通過多關(guān)鍵詞獲取相關(guān)的網(wǎng)站

@GetMapping("/api/v1/keyresult")@ResponseBodypublic String getKeyResult(String key, String key2, String key3,Integer page, Integer size) {    Set<String> x = new HashSet<>(demoService.siteMaps.get(key));    if (StringUtils.hasText(key2)) {        key2 = key2.trim();        if (StringUtils.hasText(key2)){            Set<String> x2 = demoService.siteMaps.get(key2);            x.retainAll(x2);        }    }    if (StringUtils.hasText(key3)) {        key3 = key3.trim();        if (StringUtils.hasText(key3)){            Set<String> x3 = demoService.siteMaps.get(key3);            x.retainAll(x3);        }    }    if (Objects.nonNull(x) && x.size() > 0) {        Set<String> tmp = new HashSet<>();        for (String y : x) {            String yy = demoService.stringSiteMap.get(y).getKeywords();            int l = yy.indexOf(key);            if (l != -1) {                String yyy = "";                int flag = 0;                for (int j = l; j < yy.length(); j++) {                    if (yy.charAt(j) == ',') {                        break;                    }                    if (flag == 1) {                        yyy = yyy + yy.charAt(j);                    }                    if (yy.charAt(j) == ' ') {                        flag = 1;                    }                }                tmp.add(y + "----" + demoService.stringSiteMap.get(y).getTitle() + "----" + key + "----" + yyy);            } else {                Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key);                if (titleContains) {                    tmp.add(y + "----" + demoService.stringSiteMap.get(y).getTitle() + "----" + key + "----標題含有");                }            }        }        StringBuffer ans = new StringBuffer();        List<String> temp = new ArrayList<>(tmp);        for (int i = (page - 1) * size; i < temp.size() && i < page * size; i++) {            String s = temp.get(i);            ans.append("<a href=/"http://" + s.split("----")[0] + "/" style=/"font-size: 20px/">"                       + s.split("----")[1] + "</a> <p style=/"font-size: 15px/">" + s.split("----")[0] + "&nbsp;&nbsp;&nbsp;" + s.split("----")[3]                       + "</p><hr color=/"silver/" size=1/>");        }        return ans.toString();    }    return "暫未收錄";}

告知爬蟲作為爬取目標的網(wǎng)站

@GetMapping("/api/v1/demo")@ResponseBodypublic void demo(String key) {    new Thread(new Runnable() {        @Override        public void run() {            HttpClientCrawl clientCrawl = new HttpClientCrawl(key);            try {                clientCrawl.doTask();            } catch (Exception e) {                e.printStackTrace();            }            finally {                clientCrawl.oldDomains.clear();                clientCrawl.siteMaps.clear();                clientCrawl.onePageMap.clear();                clientCrawl.ignoreSet.clear();            }        }    }).start();}

這是一個非正式的項目，所以寫得比較簡陋和隨意，見諒。 ?

展現(xiàn)

展現(xiàn)部分使用的是谷歌插件，最快捷的方式就是去github上下一個線程的插件來修修補補實現(xiàn)。跑成功了再去深入研究原理和奧秘。傳送門 ?

中間的實現(xiàn)過程就和寫個普通的頁面沒啥區(qū)別，所以省略了。 ?

最終的結(jié)果如下：

然后搜索部分如下：

在這里我想拜托各位朋友一件事：如果你們有收藏了好久的技術(shù)網(wǎng)站，可以分享在評論區(qū)，我目前陷入了抓取目標瓶頸。各行各業(yè)都可以。

關(guān)鍵詞：索引,領(lǐng)域,垂直

網(wǎng)站
營銷
設(shè)計
運營
優(yōu)化
效率
專注
電商
方案
推廣

解決方案&服務(wù)

客戶&案例

營銷資訊

關(guān)于我們

解決方案&服務(wù)

客戶&案例

營銷資訊

關(guān)于我們

微信公眾號

版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。

為了最佳展示效果，本站不支持IE9及以下版本的瀏覽器，建議您使用谷歌Chrome瀏覽器。點擊下載Chrome瀏覽器

關(guān)閉

国产成人精品无码青草_亚洲国产美女精品久久久久∴_欧美人与鲁交大毛片免费_国产果冻豆传媒麻婆精东

快捷入口

手寫一個垂直領(lǐng)域的搜索引擎

理論架構(gòu)

頁面抓取

存儲

分析

搜索實現(xiàn)

展現(xiàn)

動手實現(xiàn)

頁面抓取

存儲

分析

搜索實現(xiàn)

展現(xiàn)

時代變革離不開新科技，更離不開新青年

三模式的區(qū)別：商業(yè)、運營和盈利

法國電商平臺最新最全匯總（轉(zhuǎn)自Lengow樂售）

水滴郵件營銷：讓企業(yè)營銷更簡單

saas系統(tǒng)軟件外包機構(gòu)

香港清算支付體系FPS：快速支付系統(tǒng)（轉(zhuǎn)數(shù)快）

CRM系統(tǒng)百科

怎樣提升網(wǎng)站的內(nèi)容質(zhì)量

2020年跨境電商clubfactory平臺介紹

阿里全球速賣通平臺簡介及怎么入駐及要求介紹

国产成人精品无码青草_亚洲国产美女精品久久久久∴_欧美人与鲁交大毛片免费_国产果冻豆传媒麻婆精东

快捷入口

手寫一個垂直領(lǐng)域的搜索引擎

理論架構(gòu)

頁面抓取

存儲

分析

搜索實現(xiàn)

展現(xiàn)

動手實現(xiàn)

頁面抓取

存儲

分析

搜索實現(xiàn)

展現(xiàn)

推薦文章

鳳岡縣治理工程建設(shè)領(lǐng)域突出問題工作推進計劃

鳳岡縣治理工程建設(shè)領(lǐng)域突出問題工作推進計劃

二三六啦淘寶聯(lián)盟服務(wù)領(lǐng)域

余英杰專精領(lǐng)域

垂直電商與綜合電商有什么特別明顯的區(qū)別嗎?

垂直電商發(fā)展趨勢，電商運營推廣策略技巧？

在大型綜合類電商（如天貓）和實體店的夾縫中，酒類垂直電商真正發(fā)展起

經(jīng)常聽到很多人問，什么是垂直類目？

如何做好戶外旅行的垂直電子商務(wù)平臺網(wǎng)站？

一個人能否做成一個垂直電子商務(wù)網(wǎng)站？

時代變革離不開新科技，更離不開新青年

三模式的區(qū)別：商業(yè)、運營和盈利

法國電商平臺最新最全匯總（轉(zhuǎn)自Lengow樂售）

水滴郵件營銷：讓企業(yè)營銷更簡單

saas系統(tǒng)軟件外包機構(gòu)

香港清算支付體系FPS：快速支付系統(tǒng)（轉(zhuǎn)數(shù)快）

CRM系統(tǒng)百科

怎樣提升網(wǎng)站的內(nèi)容質(zhì)量

2020年跨境電商clubfactory平臺介紹

阿里全球速賣通平臺簡介及怎么入駐及要求介紹

垂直電商發(fā)展趨勢，電商運營推廣策略技巧？

在大型綜合類電商（如天貓）和實體店的夾縫中，酒類垂直電商真正發(fā)展起

經(jīng)常聽到很多人問，什么是垂直類目？

如何做好戶外旅行的垂直電子商務(wù)平臺網(wǎng)站？

一個人能否做成一個垂直電子商務(wù)網(wǎng)站？

時代變革離不開新科技，更離不開新青年