MySQL Conference – Building a Vertical Search Engine in a Day

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day.

Some of my notes for reference.

Web Crawling 101

  • Injection List – What is it seed URL’s you are starting from
  • Fetching the pages
  • Parsing the content – words and links
  • Updating the crawl DB
  • Whitelist
  • Blacklist
  • Convergence — avoiding the honey pots
  • Index
  • Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

  • 20 Billion Pages out there, a lot of junk
  • Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

  • OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
  • Pure OPIC means “Fetch well-linked pages first”
  • We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results