MySQL Conference – Building a Vertical Search Engine in a Day

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .

Some of my notes for reference.

Web Crawling 101

  • Injection List – What is it seed URL’s you are starting from
  • Fetching the pages
  • Parsing the content – words and links
  • Updating the crawl DB
  • Whitelist
  • Blacklist
  • Convergence — avoiding the honey pots
  • Index
  • Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

  • 20 Billion Pages out there, a lot of junk
  • Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

  • OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
  • Pure OPIC means “Fetch well-linked pages first”
  • We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results

Tagged with: Databases General MySQL MySQL Conference &Amp; Expo 2007

Related Posts

Sysbench Under the Covers

Sysbench is a popular open-source benchmarking tool designed to evaluate the performance of system components such as CPU, memory, disk I/O, and databases. It is commonly used for testing MySQL, PostgreSQL, and other databases under different load conditions.

Read more

Tracking new AWS Database Infrastructure Availability

AWS can drop 10+ articles a day just in the What’s New feed. You either need an eagle eye, or luck to keep up if you run multiple AWS database products across multiple regions and managing infrastructure.

Read more

Evaluating Readyset Caching for MySQL

Readyset is a database caching solution for MySQL and PostgreSQL . For applications that have increased load on your primary database, or use scale-out infrastructure to support a high-read workload, ReadySet can be a drop-in solution to address current performance issues.

Read more