Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .
Some of my notes for reference.
Web Crawling 101
- Injection List – What is it seed URL’s you are starting from
- Fetching the pages
- Parsing the content – words and links
- Updating the crawl DB
- Whitelist
- Blacklist
- Convergence — avoiding the honey pots
- Index
- Map-reduce — split a large problem into little pieces, process in parallel, then combine results
Focused content == vertical crawl
- 20 Billion Pages out there, a lot of junk
- Bread-first would take years and cost millions of lives
OPIC + Term Vectors = Depth-first
- OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
- Pure OPIC means “Fetch well-linked pages first”
- We modify it to “fetch pages about MySQL first”
Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .
Serving up the results
- Generating the index
- Setting up Nutch with Tomcat (can also run with Resin) Introduction to Nutch, Part 2: Searching , Running Nutch with Mac OSX
- Single searcher vs. multiple searchers
- Optimizing the user interface