MySQL Conference – Building a Vertical Search Engine in a Day

Ronald Bradford
April 24, 2007

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .

Some of my notes for reference.

Web Crawling 101

Injection List – What is it seed URL’s you are starting from
Fetching the pages
Parsing the content – words and links
Updating the crawl DB
Whitelist
Blacklist
Convergence — avoiding the honey pots
Index
Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

20 Billion Pages out there, a lot of junk
Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
Pure OPIC means “Fetch well-linked pages first”
We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results

Generating the index
Setting up Nutch with Tomcat (can also run with Resin) Introduction to Nutch, Part 2: Searching , Running Nutch with Mac OSX
Single searcher vs. multiple searchers
Optimizing the user interface

Tagged with: Databases General MySQL MySQL Conference &Amp; Expo 2007

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.

More CPUs or Newer CPUs

Ronald Bradford
April 2, 2025

In a CPU-bound database workload, regardless of price, would you scale-up or scale-new? What if price was the driving factor, would you scale-up or scale-new? I am using as a baseline the first available AWS Graviton2 processor for RDS (r6g).

MySQL Conference – Building a Vertical Search Engine in a Day

Related Posts

MySQL and Heatwave Summit Presentation

Readyset QueryPilot Announcement

More CPUs or Newer CPUs