What parts does a search engine consist of?
Spiders are responsible for crawling web information. Generally speaking, word breakers and indexers are used together. They are responsible for word segmentation and automatic indexing of the captured web pages, and establishing an indexing database. The searcher searches the index database according to the user's query conditions, sorts and aggregates the search results in parallel and intersection, and then extracts the simple summary information of the web page and feeds it back to the query user. Google search engine is also divided into three parts from the function: web crawling, index storage and user query. Web crawling is mainly responsible for crawling web pages, which consists of URL server, crawler, memory, analyzer and URL parser. Crawler is the core of this part. Index warehousing is mainly responsible for analyzing the contents of web pages, indexing documents and storing them in the database. It consists of indexer and classifier. This module involves a lot of documents and data, and the operation about barrels is the core of this part. User query is mainly responsible for analyzing the retrieval expressions input by users, matching related documents, and returning the retrieval results to users. It consists of a query device and a webpage scoring device, and the calculation of webpage scoring is the core of this part. For example: SOPI search engine system SOPI is a small search engine system with similar functions to Baidu and GOOGLE, which is suitable for information search and display services of small and medium-sized websites and enterprises. All the contents of this website are automatically obtained through this system. The system performance parameters are as follows: platform: 1U compatible server, dual-core Xeon 2.8G, 1G memory index library size: 5G database: SqlServer2005 running environment: Microsoft. NET Framework SDK v2.0 Average memory usage: 600-900MCPU usage: 10%-80% Number of new articles and pictures added every day: 65438+ million search time: 5G content search. Results 0.3- 1 sec SOPI consists of five parts: information collection system, information analysis system, index system, management system and website platform. The structure is as follows: The main workflow of the search engine is: first, start with spiders. Spider programs automatically start and read the URL list on the URL server of web pages every certain time (like google is generally 28 days), and crawl the websites designated by each URL according to the depth-first or breadth-first search method, assign a unique document ID(DocId) to the crawled web pages, and store them in the document database. Usually, it is compressed before being stored in the document database. And store all hyperlinks on the current page in the URL server. At the same time of crawling, the word segmentation and indexer process the crawled web document, calculate the weight according to the position and frequency of words appearing in the web page, and then store the word segmentation results in the index database. After the whole crawling and indexing work is completed, the whole index database and document database are updated so that users can query the latest web information.