MySQL Conference – Building a Vertical Search Engine in a Day

Ronald Bradford
April 24, 2007

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .

Some of my notes for reference.

Web Crawling 101

Injection List – What is it seed URL’s you are starting from
Fetching the pages
Parsing the content – words and links
Updating the crawl DB
Whitelist
Blacklist
Convergence — avoiding the honey pots
Index
Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

20 Billion Pages out there, a lot of junk
Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
Pure OPIC means “Fetch well-linked pages first”
We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results

Generating the index
Setting up Nutch with Tomcat (can also run with Resin) Introduction to Nutch, Part 2: Searching , Running Nutch with Mac OSX
Single searcher vs. multiple searchers
Optimizing the user interface

Tagged with: Databases General MySQL MySQL Conference &Amp; Expo 2007

Improving your MySQL Security Posture Presentation

Ronald Bradford
October 5, 2025

At the MySQL BR Conference 2025 I had the opportunity to speak about Improving Your MySQL Security Posture. You can find a copy of my slides on my Presentations , and a Portugese (Brazil) translation.

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.

MySQL Conference – Building a Vertical Search Engine in a Day

Related Posts

Improving your MySQL Security Posture Presentation

MySQL and Heatwave Summit Presentation

Readyset QueryPilot Announcement