MySQL Conference – Building a Vertical Search Engine in a Day

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .

Some of my notes for reference.

Web Crawling 101

  • Injection List – What is it seed URL’s you are starting from
  • Fetching the pages
  • Parsing the content – words and links
  • Updating the crawl DB
  • Whitelist
  • Blacklist
  • Convergence — avoiding the honey pots
  • Index
  • Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

  • 20 Billion Pages out there, a lot of junk
  • Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

  • OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
  • Pure OPIC means “Fetch well-linked pages first”
  • We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results

Tagged with: Databases General MySQL MySQL Conference &Amp; Expo 2007

Related Posts

More CPUs or Newer CPUs

In a CPU-bound database workload, regardless of price, would you scale-up or scale-new? What if price was the driving factor, would you scale-up or scale-new? I am using as a baseline the first available AWS Graviton2 processor for RDS (r6g).

Read more

An Interesting Artifact with AWS RDS Aurora Storage

As part of using public datasets with my own Benchmarking Suite I wanted upsize a dataset for larger volume testing. I have always used the INFORMATION_SCHEMA.TABLES data_length and index_length columns as a sufficiently accurate measurement for actual disk space used.

Read more

How long does it take the ReadySet cache to warm up?

During my setup of benchmarking I run a quick test-sysbench script to ensure my configuration is right before running an hour+ duration test. When pointing to a Readyset cache where I have cached the 5 queries used in the sysbench test, but I have not run any execution of the SQL, throughput went up 10x in 5 seconds.

Read more