MySQL Conference – Building a Vertical Search Engine in a Day

Ronald Bradford
April 24, 2007

Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day .

Some of my notes for reference.

Web Crawling 101

Injection List – What is it seed URL’s you are starting from
Fetching the pages
Parsing the content – words and links
Updating the crawl DB
Whitelist
Blacklist
Convergence — avoiding the honey pots
Index
Map-reduce — split a large problem into little pieces, process in parallel, then combine results

Focused content == vertical crawl

20 Billion Pages out there, a lot of junk
Bread-first would take years and cost millions of lives

OPIC + Term Vectors = Depth-first

OPIC is “On-line Page Importance Calculation”. Fixing OPIC Scoring Paper
Pure OPIC means “Fetch well-linked pages first”
We modify it to “fetch pages about MySQL first”

Nutch & Hadoop are the technologies that run on a 4 server cluster. Sample starting with www.mysql.com in 23 loops, 150k pages fetched, 2M URL’s found .

Serving up the results

Generating the index
Setting up Nutch with Tomcat (can also run with Resin) Introduction to Nutch, Part 2: Searching , Running Nutch with Mac OSX
Single searcher vs. multiple searchers
Optimizing the user interface

Tagged with: Databases General MySQL MySQL Conference &Amp; Expo 2007

Producing Chi-Squared statistics with SQL

Ronald Bradford
June 12, 2026

The Chi-Squared test is one of the most widely used statistical tests for categorical data. It comes in two flavors: the goodness-of-fit test asks whether an observed frequency distribution matches an expected one, while the test of independence asks whether two categorical variables are associated with each other.

Speaking at COSCUP 2026 — Planning your upgrade to MySQL 9.7

Ronald Bradford
June 10, 2026

I am excited to be speaking at COSCUP 2026 in Taipei, Taiwan on August 8th and 9th. COSCUP (Conference for Open Source Coders, Users and Promoters) is one of the largest open source conferences in Asia, and it is always a privilege to present to the engaged and technically sharp community there.

Producing Two Sample T-Test statistics with SQL

Ronald Bradford
June 8, 2026

The two sample t-test for equal variance is a statistical test to determine if the means of two groups are different enough that the difference is likely caused by some underlying difference, rather than random chance.