Archive for the ‘Data Stores’ Category

NoSQL from a RDBMS company

Tuesday, October 4th, 2011

Oracle has announced an open source product for the NoSQL space, the Oracle NoSQL Database. Unlike other popular products including Redis, MongoDB, Cassandra, Voldermort and many others, Oracle has set a benchmark on the features that are truly necessary for highly available data systems.

Many products in the NoSQL space have told you that consistency is not needed, eventual consistency is good enough, that transactions are not performant enough to include as a feature. No standards exists, there is no common interface for communication, or key features that products aim to meet or better. With this product, features including transactions, replicated data and failover which are built in, are features other open source NoSQL products will need to match.

Oracle NoSQL Database is a key value store, supporting a major/minor key for co-locating regularly accessed information for more consistent data retrieval. The API (built in Java) supports GET, PUT, and DEL operators. The system is designed to not have a single point of failure, and to support a node failure without impact. The replication factor is reported to enable up to 7 copies of information, which would be a feature to support cross data center management. The database driver is latency aware, so this can support load balancing operations for optimal performance.

I am excited to hear about this and looking forward to evaluating the software. I will be watching more closely how the integration of MySQL and Oracle NoSQL can be an offering for startups and Web 2.0

MongoDB Experience: Server logging

Friday, June 11th, 2010

By default the mongod process sends all output to stdout. You can also specify the daemon to log to file which is necessary for any production implementation. For example:

$ mongod --logpath=`pwd`/mongo.log  &
all output going to: /home/rbradfor/projects/mongo/mongo.log
^C

As you can see there is still a message to stdout, that should be cleaned up for a GA release. The output you will see for a clean startup/shutdown is:

Fri Jun 11 14:05:29 Mongo DB : starting : pid = 7990 port = 27017 dbpath = /home/rbradfor/projects/mongo/data/cu
rrent master = 0 slave = 0  64-bit
Fri Jun 11 14:05:29 db version v1.4.3, pdfile version 4.5
Fri Jun 11 14:05:29 git version: 47ffbdfd53f46edeb6ff54bbb734783db7abc8ca
Fri Jun 11 14:05:29 sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28
 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Fri Jun 11 14:05:29 waiting for connections on port 27017
Fri Jun 11 14:05:29 web admin interface listening on port 28017
Fri Jun 11 14:05:31 got kill or ctrl c signal 2 (Interrupt), will terminate after current cmd ends
Fri Jun 11 14:05:31 now exiting
Fri Jun 11 14:05:31  dbexit:
Fri Jun 11 14:05:31 	 shutdown: going to close listening sockets...
Fri Jun 11 14:05:31 	 going to close listening socket: 5
Fri Jun 11 14:05:31 	 going to close listening socket: 6
Fri Jun 11 14:05:31 	 shutdown: going to flush oplog...
Fri Jun 11 14:05:31 	 shutdown: going to close sockets...
Fri Jun 11 14:05:31 	 shutdown: waiting for fs preallocator...
Fri Jun 11 14:05:31 	 shutdown: closing all files...
Fri Jun 11 14:05:31      closeAllFiles() finished
Fri Jun 11 14:05:31 	 shutdown: removing fs lock...
Fri Jun 11 14:05:31  dbexit: really exiting now

MongoDB logging does not give an option to format the date/time appropriately. The format does not match the syslog of Ubuntu/CentOS

Jun  9 10:05:46 barney kernel: [1025968.983209] SGI XFS with ACLs, security attributes, realtime, large block/in
ode numbers, no debug enabled
Jun  9 10:05:46 barney kernel: [1025968.984518] SGI XFS Quota Management subsystem
Jun  9 10:05:46 barney kernel: [1025968.990183] JFS: nTxBlock = 8192, nTxLock = 65536
Jun  9 10:05:46 barney kernel: [1025969.007624] NTFS driver 2.1.29 [Flags: R/O MODULE].
Jun  9 10:05:46 barney kernel: [1025969.020995] QNX4 filesystem 0.2.3 registered.
Jun  9 10:05:46 barney kernel: [1025969.039264] Btrfs loaded
Jun  8 00:00:00 dc1 nagios: CURRENT HOST STATE: localhost;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.01 ms
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;Current Load;OK;HARD;1;OK - load average: 0.00, 0.00, 0.00
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;Current Users;OK;HARD;1;USERS OK - 2 users currently logged in
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;HTTP;CRITICAL;HARD;4;Connection refused
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.01 ms
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;Root Partition;OK;HARD;1;DISK OK - free space: / 107259 MB (49% inode=98%):
Jun  8 00:00:00 dc1 nagios: CURRENT SERVICE STATE: localhost;SSH;OK;HARD;1;SSH OK - OpenSSH_4.3 (protocol 2.0)

And for reference here is the mysql format, which is also not configurable.

100605 16:43:38 mysqld_safe Starting mysqld daemon with databases from /opt/mysql51/data
100605 16:43:38 [Warning] '--log_slow_queries' is deprecated and will be removed in a future release. Please use ''--slow_query_log'/'--slow_query_log_fi
le'' instead.
100605 16:43:38 [Warning] '--log' is deprecated and will be removed in a future release. Please use ''--general_log'/'--general_log_file'' instead.
100605 16:43:38 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts a
s a master and has his hostname changed!! Please use '--log-bin=dc1-bin' to avoid this problem.
/opt/mysql51/bin/mysqld: File './dc1-bin.index' not found (Errcode: 13)
100605 16:43:38 [ERROR] Aborting

However unlike other products including MySQL the next execution of the mongod process overwrites the log file. This will catch some administrators out. You need to remember to also add –logappend. Personally I’d prefer to see this is the default

$ mongod --logpath=`pwd`/mongo.log --logappend

I did observe some confusion on messaging. Using the mongo shell you get a jumble of logging messages during a shutdown.

$ mongo
MongoDB shell version: 1.4.3
url: test
connecting to: test
type "help" for help
> use admin
switched to db admin
> db.shutdownServer();
Fri Jun 11 13:54:08 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1
server should be down...
Fri Jun 11 13:54:08 trying reconnect to 127.0.0.1
Fri Jun 11 13:54:08 reconnect 127.0.0.1 ok
Fri Jun 11 13:54:08 query failed : admin.$cmd { getlasterror: 1.0 } to: 127.0.0.1
Fri Jun 11 13:54:08 JS Error: Error: error doing query: failed (anon):1284
> exit
bye

This also results in an unformatted message in the log file for some reason.

$ tail mongo.log
Fri Jun 11 13:54:08 	 shutdown: removing fs lock...
Fri Jun 11 13:54:08  dbexit: really exiting now
Fri Jun 11 13:54:08 got request after shutdown()
ERROR: Client::~Client _context should be NULL: conn

Nothing of any critical nature however all important for system administrators that have monitoring scripts or using monitoring products.

MongoDB Experience: Key/Value Store

Friday, June 11th, 2010

MongoDB is categorized as a schema-less, schema-free or a document orientated data store. Another category of NoSQL product is the key/value store. It had not dawned on me until a discussion with some of the 10gen employees that MongoDB is also a key/value store, this is just a subset of features.

How would you consider the design of a key/value store? Using the memached model, there are 4 primary attributes to consider:

  • The Key to store/retrieve
  • The Value for the given key
  • An auto expiry of the cached data
  • A key scope enabling multiple namespaces

There are three primary functions:

  • Put a given Key/Value pair
  • Get a given Key
  • Delete a given Key

Let’s explore the options. The first is to create a new collection for each key. That way there is only one row per key,

> use keystore
> var d = new Date();
> var id = "key1";
> var kv = { key: id,val: "Hello World",expires: d}
> db.key1.save(kv);
> db.key1.find();
{ "_id" : ObjectId("4c126095c68fcaf3b0e07a2b"), "key" : "key1", "val" : "Hello World", "expires" : "Fri Jun 11 2010 12:09:51 GMT-0400 (EDT)" }

However when we start loading we run into a problem.

> db.key99999.save({key: "key99999", val: "Hello World", expires: new Date()})
too many namespaces/collections
> show collections;
Fri Jun 11 12:49:02 JS Error: uncaught exception: error: {
	"$err" : "too much key data for sort() with no index.  add an index or specify a smaller limit"
}
> db.stats()
{
	"collections" : 13661,
	"objects" : 26118,
	"dataSize" : 2479352,
	"storageSize" : 93138688,
	"numExtents" : 13665,
	"indexes" : 13053,
	"indexSize" : 106930176,
	"ok" : 1
}

I did read there was a limit on the number of collections at Using a Large Number of Collections.
Also for reference, I look at the underlying data files shows the ^2 increment of data files.

$ ls -lh data/current
total 2.2G
-rw------- 1 rbradfor rbradfor  64M 2010-06-11 12:45 keystore.0
-rw------- 1 rbradfor rbradfor 128M 2010-06-11 12:45 keystore.1
-rw------- 1 rbradfor rbradfor 256M 2010-06-11 12:46 keystore.2
-rw------- 1 rbradfor rbradfor 512M 2010-06-11 12:48 keystore.3
-rw------- 1 rbradfor rbradfor 1.0G 2010-06-11 12:48 keystore.4
-rw------- 1 rbradfor rbradfor  16M 2010-06-11 12:48 keystore.ns
> db.dropDatabase();
{ "dropped" : "keystore.$cmd", "ok" : 1 }

In my next test I’ll repeat by adding the key as a row or document for just one collection.

MongoDB Experience: Stats Example App

Thursday, June 10th, 2010

The best way to learn any new product is to a) read the manual, and b) start using the product.

I created a simple sample application so I could understand the various functions including adding data, searching as well as management functions etc. As with any good sample application using a source of data that already exists always makes life easier. For this example I’m going to use the Operating System output so I will have an ever increasing amount of output for no additional work.

I will be starting with a database called ‘stats’. For this database my first collection is going to be called ‘system’ and this is going to record the most basic of information including date/time, host and cpu (user,sys,idle) stats. I have a simple shell script that creates an appropriate JSON string and I use mongoimport to load the data. Here is my Version 0.1 architectural structure.

mongo> use stats;
mongo> db.system.findOne();
{
	"_id" : ObjectId("4c11183580399ad2db4f503b"),
	"host" : "barney",
	"epoch" : 1276188725,
	"date" : "Thu Jun 10 12:52:05 EDT 2010",
	"cpu" : {
		"user" : 2,
		"sys" : 2,
		"idle" : 95
	},
	"raw" : " 11435699 1379565 9072198 423130352 2024835 238766 2938641 0 0"
}

I made some initial design decisions before I understand the full strengths/limitation of MongoDB as well as what my actual access paths to data will be.
While I’m using a seconds since epoch for simple range searching, I’m adding a presentation date for user readability. I’ve created a different sub element for cpu, because it a) this element has a number of individual attributes I will want to report and search on, and b) this collection should be extended to include other information like load average, running processes, memory etc.

If my shell script runs in debug mode, I also record the raw data used to determine the end result. This makes debugging easier.

Here is my first query.

Find all statistics between two dates. It took a bit of getting the correct construct syntax correct, $le and $ge didn’t work so RTFM highlighted the correct syntax. I also first included elements for epoch, which resulted in a OR condition, I see you can add multiple comparison operators to a single element to get an AND operation.

mongo> db.system.find({epoch: { $gte: 1276188725, $lte: 1276188754}});
{ "_id" : ObjectId("4c11183580399ad2db4f503b"), "host" : "barney", "epoch" : 1276188725, "date" : "Thu Jun 10 12:52:05 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 }, "raw" : " 11435699 1379565 9072198 423130352 2024835 238766 2938641 0 0" }
{ "_id" : ObjectId("4c11184c80399ad2db4f503c"), "host" : "barney", "epoch" : 1276188748, "date" : "Thu Jun 10 12:52:28 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 }, "raw" : " 11436605 1379565 9072320 423138450 2024862 238770 2938641 0 0" }
{ "_id" : ObjectId("4c11185080399ad2db4f503d"), "host" : "barney", "epoch" : 1276188752, "date" : "Thu Jun 10 12:52:32 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 }, "raw" : " 11437005 1379565 9072330 423139527 2024862 238770 2938641 0 0" }
{ "_id" : ObjectId("4c11185180399ad2db4f503e"), "host" : "barney", "epoch" : 1276188753, "date" : "Thu Jun 10 12:52:33 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 }, "raw" : " 11437130 1379565 9072334 423139862 2024862 238770 2938641 0 0" }
{ "_id" : ObjectId("4c11185280399ad2db4f503f"), "host" : "barney", "epoch" : 1276188754, "date" : "Thu Jun 10 12:52:34 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 }, "raw" : " 11437316 1379565 9072338 423140325 2024910 238770 2938641 0 0" }

Assuming I’m going to have stats from more then one server in my data, we should always filter by hostname, and then for given period.

mongo> db.system.find({host: "barney", epoch: { $gte: 1276188725, $lte: 1276188754}});

If I only want to see the Date/Time and CPU stats, I can show a subset of the elements found.

mongo> db.system.find({epoch: { $gte: 1276188725, $lte: 1276188754}}, {date:1,cpu:1});
{ "_id" : ObjectId("4c11183580399ad2db4f503b"), "date" : "Thu Jun 10 12:52:05 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 } }
{ "_id" : ObjectId("4c11184c80399ad2db4f503c"), "date" : "Thu Jun 10 12:52:28 EDT 2010", "cpu" : { "user" : 2, "sys" : 2, "idle" : 95 } }
...

Filtering on a sub-element is also possible however I found that the representation of strings and numbers does not do an implied conversion. In the following example “2″ does not match any results, while 2 does.


mongo> db.system.findOne({host: "barney", "cpu.user": "2"})
null

mongo> db.system.findOne({host: "barney", "cpu.user": 2})
{
	"_id" : ObjectId("4c11161680399ad2db4f5033"),
	"host" : "barney",
	"epoch" : 1276188182,
	"date" : "Thu Jun 10 12:43:02 EDT 2010",
	"cpu" : {
		"user" : 2,
		"sys" : 2,
		"idle" : 95
	}
}

Given the collection and load process works, data is being recorded and I can perform some searching I now have the basis for adding additional rich data elements, learning about the internal DBA operations possible after I fix the bug with all my values being 2/2/95.

MongoDB Experience: Replication 101

Thursday, June 10th, 2010

After successfully installing and testing mongoDB it’s very easy to create a replication environment.

$ mkdir -p data/{master,slave}
$ mongod --dbpath=`pwd`/data/master --master --port 28011 > master.log 2>&1 &
# Always check your log file
$ cat master.log
$ mongod --dbpath=`pwd`/data/slave --slave --source localhost:28011 --port 28022 > slave.log 2>&1 &
$ cat slave.log

The options are relatively descriptive and straightforward.

  • –dbpath – The directory for data (we set because we are running master/slave on same server)
  • –port – Likewise we are running multiple instances on same machine
  • –master – I’m the master
  • –slave – I’m a slave
  • –source – For slaves, tell them were the source (i.e. master is)

What I found under the covers was a difference from the single instance version. There is a series of ‘local’ files for the namespace, where in the single instance version there were ‘test’ files.

$ ls -ltR data
total 0
drwxr-xr-x  6 rbradfor  staff  204 Jun 10 10:24 slave
drwxr-xr-x  5 rbradfor  staff  170 Jun 10 10:22 master

data/slave:
total 163848
drwxr-xr-x  2 rbradfor  staff        68 Jun 10 10:24 _tmp
-rw-------  1 rbradfor  staff  67108864 Jun 10 10:24 local.0
-rw-------  1 rbradfor  staff  16777216 Jun 10 10:24 local.ns
-rwxr-xr-x  1 rbradfor  staff         6 Jun 10 10:24 mongod.lock

data/slave/_tmp:

data/master:
total 163848
-rw-------  1 rbradfor  staff  67108864 Jun 10 10:22 local.0
-rw-------  1 rbradfor  staff  16777216 Jun 10 10:22 local.ns
-rwxr-xr-x  1 rbradfor  staff         6 Jun 10 10:22 mongod.lock

A quick replication test.

$ mongo --port 28011
MongoDB shell version: 1.4.3
url: test
connecting to: 127.0.0.1:28011/test
type "help" for help
> db.foo.save({s:"Hello world"});
> db.foo.find();
{ "_id" : ObjectId("4c10f7904a30c35548b0af06"), "s" : "Hello world" }
> exit
bye

$ mongo --port 28022
MongoDB shell version: 1.4.3
url: test
connecting to: 127.0.0.1:28022/test
type "help" for help
> db.foo.find();
{ "_id" : ObjectId("4c10f7904a30c35548b0af06"), "s" : "Hello world" }
> exit

A look now at the underlying data shows a ‘test’ namespace which confirms the lazy instantiation approach. The ‘local’ namespace files are obviously a reflection of the –master/–slave operation.

$ ls -ltR data
total 0
drwxr-xr-x  9 rbradfor  staff  306 Jun 10 10:32 slave
drwxr-xr-x  8 rbradfor  staff  272 Jun 10 10:32 master

data/slave:
total 589832
-rw-------  1 rbradfor  staff  134217728 Jun 10 10:33 test.1
drwxr-xr-x  2 rbradfor  staff         68 Jun 10 10:32 _tmp
-rw-------  1 rbradfor  staff   67108864 Jun 10 10:32 test.0
-rw-------  1 rbradfor  staff   16777216 Jun 10 10:32 test.ns
-rw-------  1 rbradfor  staff   67108864 Jun 10 10:24 local.0
-rw-------  1 rbradfor  staff   16777216 Jun 10 10:24 local.ns
-rwxr-xr-x  1 rbradfor  staff          6 Jun 10 10:24 mongod.lock

data/master:
total 327688
drwxr-xr-x  2 rbradfor  staff        68 Jun 10 10:32 _tmp
-rw-------  1 rbradfor  staff  67108864 Jun 10 10:32 test.0
-rw-------  1 rbradfor  staff  16777216 Jun 10 10:32 test.ns
-rw-------  1 rbradfor  staff  67108864 Jun 10 10:22 local.0
-rw-------  1 rbradfor  staff  16777216 Jun 10 10:22 local.ns
-rwxr-xr-x  1 rbradfor  staff         6 Jun 10 10:22 mongod.lock

By default there appears to be no read-only default state for a slave. I was able to add new data to the slave.

$ mongo --port 28022
MongoDB shell version: 1.4.3
url: test
connecting to: 127.0.0.1:28022/test
type "help" for help
> db.foo.save({s:"Hello New York"});
> db.foo.find();
{ "_id" : ObjectId("4c10f7904a30c35548b0af06"), "s" : "Hello world" }
{ "_id" : ObjectId("4c10f864d8e80f1a1ad305cf"), "s" : "Hello New York" }
>

A closer look at this ‘local’ namespace and a check via the docs gives us details of the slave configuration.

$ mongo --port 28022
MongoDB shell version: 1.4.3
url: test
connecting to: 127.0.0.1:28022/test
type "help" for help
> show dbs;
admin
local
test
> use local;
switched to db local
> show collections;
oplog.$main
pair.sync
sources
system.indexes
> db.sources.find();
{ "_id" : ObjectId("4c10f5b633308f7c3d7afc45"), "host" : "localhost:28011", "source" : "main", "syncedTo" : { "t" : 1276180895000, "i" : 1 }, "localLogTs" : { "t" : 1276180898000, "i" : 1 } }

You can also with the mongo client connect directly to a collection via the command line.

$ mongo localhost:28022/local
MongoDB shell version: 1.4.3
url: localhost:28022/local
connecting to: localhost:28022/local
type "help" for help
> db.sources.find();
{ "_id" : ObjectId("4c10f5b633308f7c3d7afc45"), "host" : "localhost:28011", "source" : "main", "syncedTo" : { "t" : 1276180775000, "i" : 1 }, "localLogTs" : { "t" : 1276180778000, "i" : 1 } }
> exit
bye

The shell gives 3 convenience commands for showing replication state.

On the Slave

$ mongo --port 28022
> db.getReplicationInfo();
{
	"logSizeMB" : 50,
	"timeDiff" : 1444,
	"timeDiffHours" : 0.4,
	"tFirst" : "Thu Jun 10 2010 10:24:54 GMT-0400 (EDT)",
	"tLast" : "Thu Jun 10 2010 10:48:58 GMT-0400 (EDT)",
	"now" : "Thu Jun 10 2010 10:48:59 GMT-0400 (EDT)"
}
> db.printReplicationInfo();
configured oplog size:   50MB
log length start to end: 1444secs (0.4hrs)
oplog first event time:  Thu Jun 10 2010 10:24:54 GMT-0400 (EDT)
oplog last event time:   Thu Jun 10 2010 10:48:58 GMT-0400 (EDT)
now:                     Thu Jun 10 2010 10:49:07 GMT-0400 (EDT)
> db.printSlaveReplicationInfo();
source:   localhost:28011
syncedTo: Thu Jun 10 2010 10:49:25 GMT-0400 (EDT)
          = 1secs ago (0hrs)

On the master, the same commands are applicable, output basically the same.

$ mongo --port 28011
> db.getReplicationInfo();
{
	"logSizeMB" : 50,
	"timeDiff" : 1714,
	"timeDiffHours" : 0.48,
	"tFirst" : "Thu Jun 10 2010 10:22:01 GMT-0400 (EDT)",
	"tLast" : "Thu Jun 10 2010 10:50:35 GMT-0400 (EDT)",
	"now" : "Thu Jun 10 2010 10:50:40 GMT-0400 (EDT)"
}
> db.printReplicationInfo();
configured oplog size:   50MB
log length start to end: 1714secs (0.48hrs)
oplog first event time:  Thu Jun 10 2010 10:22:01 GMT-0400 (EDT)
oplog last event time:   Thu Jun 10 2010 10:50:35 GMT-0400 (EDT)
now:                     Thu Jun 10 2010 10:50:45 GMT-0400 (EDT)
> db.printSlaveReplicationInfo();
local.sources is empty; is this db a --slave?
>

From these commands there seems no obvious way to easily identify if an instance is a master or not.

References

DBA operations from shell
Replication
Master/Slave Replication

MongoDB Experience: Gotcha with collection names

Wednesday, June 9th, 2010

In my earlier tests I bulk loaded data with the following command.

mongoimport -d olympics -c olympic_event -type tsv --headerline -f name,id,sport,demonstration_competitions,olympic_games_contested,competitions,contested_as_demonstration_event --drop olympic_event.tsv
connected to: 127.0.0.1
dropping: olympics.olympic_event
imported 775 objects

As you can see I imported 775 objects, however when I went to review them via the mongo interactive shell I found no data.

> use olympics;
switched to db olympics
> db.olypics.olympic_event.find();
# No results?

I was able to confirm these objects were in the namespace.

> db.system.namespaces.find();
{ "name" : "olympics.system.indexes" }
{ "name" : "olympics.demonstration_event_athlete_relationship" }
{ "name" : "olympics.demonstration_event_athlete_relationship.$_id_" }
{ "name" : "olympics.olympic_athlete" }
{ "name" : "olympics.olympic_athlete.$_id_" }
{ "name" : "olympics.olympic_athlete_affiliation" }
{ "name" : "olympics.olympic_athlete_affiliation.$_id_" }
{ "name" : "olympics.olympic_bidding_city" }
{ "name" : "olympics.olympic_bidding_city.$_id_" }
{ "name" : "olympics.olympic_city_bid" }
{ "name" : "olympics.olympic_city_bid.$_id_" }
{ "name" : "olympics.olympic_demonstration_competition" }
{ "name" : "olympics.olympic_demonstration_competition.$_id_" }
{ "name" : "olympics.olympic_demonstration_medal_honor" }
{ "name" : "olympics.olympic_demonstration_medal_honor.$_id_" }
{ "name" : "olympics.olympic_event" }
{ "name" : "olympics.olympic_event.$_id_" }
{ "name" : "olympics.olympic_event_competition" }
{ "name" : "olympics.olympic_event_competition.$_id_" }
{ "name" : "olympics.olympic_games" }
has more
> it
{ "name" : "olympics.olympic_games.$_id_" }
{ "name" : "olympics.olympic_host_city" }
{ "name" : "olympics.olympic_host_city.$_id_" }
{ "name" : "olympics.olympic_mascot" }
{ "name" : "olympics.olympic_mascot.$_id_" }
{ "name" : "olympics.olympic_medal" }
{ "name" : "olympics.olympic_medal.$_id_" }
{ "name" : "olympics.olympic_medal_demonstration" }
{ "name" : "olympics.olympic_medal_demonstration.$_id_" }
{ "name" : "olympics.olympic_medal_honor" }
{ "name" : "olympics.olympic_medal_honor.$_id_" }
{ "name" : "olympics.olympic_participating_country" }
{ "name" : "olympics.olympic_participating_country.$_id_" }
{ "name" : "olympics.olympic_sport" }
{ "name" : "olympics.olympic_sport.$_id_" }
{ "name" : "olympics.olympic_venue" }
{ "name" : "olympics.olympic_venue.$_id_" }

The problem is I was using the namespace object with db.find(), not the collection object. I am already in the database scope with the use command.

Knowing this I get what I expected with the correct collection name.

> db.olympic_event.find();
{ "_id" : ObjectId("4c0fb666a5cd86585be7c0fd"), "name" : "Men's Boxing, Super Heavyweight +91kg", "id" : "/guid/9202a8c04000641f8000000008d88df9", "sport" : "Boxing", "demonstration_competitions" : "", "olympic_games_contested" : "2008 Summer Olympics,1984 Summer Olympics,2000 Summer Olympics,2004 Summer Olympics,1988 Summer Olympics,1996 Summer Olympics,1992 Summer Olympics", "competitions" : "Boxing at the 1984 Summer Olympics - Super Heavyweight ,Boxing at the 2000 Summer Olympics - Super Heavyweight,Boxing at the 1988 Summer Olympics - Super Heavyweight ,Boxing at the 2004 Summer Olympics - Super Heavyweight,Boxing at the 1992 Summer Olympics - Super Heavyweight ,Boxing at the 2008 Summer Olympics - Super heavyweight,Boxing at the 1996 Summer Olympics - Super Heavyweight" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7c0fe"), "name" : "Men's Judo, 60 - 66kg (half-lightweight)", "id" : "/guid/9202a8c04000641f8000000008d88d0e", "sport" : "Judo", "demonstration_competitions" : "", "olympic_games_contested" : "2004 Summer Olympics,2000 Summer Olympics,2008 Summer Olympics", "competitions" : "Judo at the 2008 Summer Olympics – Men's Half Lightweight (66 kg),Judo at the 2000 Summer Olympics - Men's Half Lightweight (66 kg),Judo at the 2004 Summer Olympics - Men's Half Lightweight (66 kg)" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7c0ff"), "name" : "Men's Tennis, Indoor Singles", "id" : "/guid/9202a8c04000641f8000000010be70e8", "sport" : "Tennis", "demonstration_competitions" : "", "olympic_games_contested" : "1912 Summer Olympics,1908 Summer Olympics", "competitions" : "Tennis at the 1908 Summer Olympics - Men's Indoor Singles,Tennis at the 1912 Summer Olympics - Men's Indoor Singles" }
...

It’s interesting that a collection name can contain a fullstop ‘.’ which is the delimiter in the command syntax. In my earlier observation I was not getting an error, only an empty response. For example you can do this.

> use test;
switched to db test
> db.foo.x.save({a: 1});
> db.foo.x.find();
{ "_id" : ObjectId("4c0fc1784ff83a6831364d57"), "a" : 1 }

MongoDB Experience: What’s running in the DB

Wednesday, June 9th, 2010

You can very easily find out the running threads in the database (e.g. like a MySQL SHOW PROCESSLIST) with db.currentOp.

> db.currentOp();
{ "inprog" : [ ] }

No much happening, however under some load you can see

> db.currentOp();
{
	"inprog" : [
		{
			"opid" : 27980,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 0,
			"op" : "insert",
			"ns" : "olympics.olympic_athlete",
			"client" : "127.0.0.1:63652",
			"desc" : "conn"
		}
	]
}
> db.currentOp();
{
	"inprog" : [
		{
			"opid" : 57465,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 0,
			"op" : "insert",
			"ns" : "olympics.olympic_athlete_affiliation",
			"client" : "127.0.0.1:63653",
			"desc" : "conn"
		}
	]
}

I was able to see these when I was Bulk Loading Data

The HTTPConsole at http://localhost:28017/ (for default installation) also shows you all client connections as well as more information per thread, database uptime, replication status and a DBTOP for recent namespaces. For example:

mongodb mactazosx.local:27017

db version v1.4.3, pdfile version 4.5
git hash: 47ffbdfd53f46edeb6ff54bbb734783db7abc8ca
sys info: Darwin broadway.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_40

dbwritelocked:  0 (initial)
uptime:    2851 seconds

assertions:

replInfo:  

Clients:
Thread	OpId	Active	LockType	Waiting	SecsRunning	Op	NameSpace	Query	client	msg	progress
initandlisten	0		1			2004	test	{ name: /^local.temp./ }	0.0.0.0:0
snapshotthread	0		0			0			0.0.0.0:0
websvr	18		-1			2004	test._defaultCollection	{}	0.0.0.0:0
conn	83741		-1			2004	olympics.olympic_host_city	{}	127.0.0.1:63268
conn	83739		0			2004	?	{ getlasterror: 1.0 }	127.0.0.1:63756		

time to get dblock: 0ms
# databases: 3
Cursors byLoc.size(): 0

replication
master: 0
slave:  0
initialSyncCompleted: 1

DBTOP  (occurences|percent of elapsed)
NS	total	Reads	Writes	Queries	GetMores	Inserts	Updates	Removes
GLOBAL	1	0.00%	1	0.00%	0	0.00%	1	0.00%	0	0.00%	0	0.00%	0	0.00%	0	0.00%
olympics.olympic_host_city	1	0.00%	1	0.00%	0	0.00%	1	0.00%	0	0.00%	0	0.00%	0	0.00%	0	0.00%

It was interesting to see a whatsmyuri command. Will need to investigate that further.

MongoDB Experience: Bulk Loading Data

Wednesday, June 9th, 2010

mongoDB has a mongoimport command. The docs only shows the usage but not any examples. here are my first examples.

data1.csv

1
2
3
4
5
6
7
8
9
0

You need to specify your database (-d) and collection (-c) for importing. In my example, I also specified the collection fields with (-f)

The –file is actually optional, specifying the filename as the large argument also works.

mongoimport -d test -c foo -f a -type csv data
connected to: 127.0.0.1
imported 10 objects

NOTE: The default type is JSON, so you can get some nasty errors if you forget the csv type.

Wed Jun  9 11:18:26 Assertion: 10340:Failure parsing JSON string near: 1
0x68262 0x23968 0x250563 0x251c7b 0x24cb00 0x250280 0x1af6
 0   mongoimport                         0x00068262 _ZN5mongo11msgassertedEiPKc + 514
 1   mongoimport                         0x00023968 _ZN5mongo8fromjsonEPKc + 520
 2   mongoimport                         0x00250563 _ZN6Import9parseLineEPc + 131
 3   mongoimport                         0x00251c7b _ZN6Import3runEv + 2635
 4   mongoimport                         0x0024cb00 _ZN5mongo4Tool4mainEiPPc + 2880
 5   mongoimport                         0x00250280 main + 496
 6   mongoimport                         0x00001af6 start + 54
exception:Failure parsing JSON string near: 1

In my second example I’m adding multiple fields. This time my data file also has a headers which you can ignore with (–headerline)

data2.csv

name, age
Mickey Mouse,65
Minnie Mouse,64
Donald Duck,
Taz Devil,22
Marvin the Martian,45
$ mongoimport -d test -c foo -f name,age -type csv --headerline data2.csv
connected to: 127.0.0.1
imported 6 objects
> db.foo.find();
...
{ "_id" : ObjectId("4c0fb0dfa5cd86585be6ca63"), "a" : 0 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca64"), "name" : "Mickey Mouse", "age" : 65 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca65"), "name" : "Minnie Mouse", "age" : 64 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca66"), "name" : "Donald Duck" }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca67"), "name" : "Taz Devil", "age" : 22 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca68"), "name" : "Marvin the Martian", "age" : 45 }

You can also use the –drop argument to truncate your collection before loading.

Real Data

I’m going to use the Freebase Olympics data to perform a more robust test.

wget http://download.freebase.com/datadumps/2010-04-15/browse/olympics.tar.bz2
bunzip2 olympics.tar.bz2
tar xvf olympics.tar
cd olympics

Loading this data via the following convenience script gave me some more meaningful data.

> db.olympic_host_city.find();
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b6"), "name" : "Vancouver", "id" : "/guid/9202a8c04000641f80000000000401e2", "olympics_hosted" : "2010 Winter Olympics" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b7"), "name" : "Moscow", "id" : "/guid/9202a8c04000641f800000000002636c", "olympics_hosted" : "1980 Summer Olympics" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b8"), "name" : "St. Moritz", "id" : "/guid/9202a8c04000641f80000000001c33e8", "olympics_hosted" : "1948 Winter Olympics,1928 Winter Olympics" }
...

Here is the simple load script I used.

#!/bin/sh

load_file() {
  local INPUT_FILE=$1
  [ -z "${INPUT_FILE}" ] && echo "ERROR: File not specified" && return 1

  echo "Loading file ${INPUT_FILE}"

  COLLECTION=`echo ${INPUT_FILE} | cut -d. -f1`

  FIELDS=`head -1 ${INPUT_FILE} | sed -e "s/	/,/g;s/ /_/g"`
  echo "mongoimport -d olympics -c ${COLLECTION} -type tsv --headerline -f $FIELDS --drop ${INPUT_FILE}"
  time mongoimport -d olympics -c ${COLLECTION} -type tsv --headerline -f $FIELDS --drop ${INPUT_FILE}
  return 0
}

process_dir() {

  echo "Processing" `pwd`
  for FILE in `ls *.tsv`
  do
    load_file ${FILE}
  done

  return 0
}

main() {
  process_dir
}

main $*
exit 0

MongoDB Experience: Online Resources

Wednesday, June 9th, 2010

Following the initial Quickstart docs you will find a lot of online documentation. The following are good places to start.

There is also a Getting Started however I found this a duplication of the Quickstart. I have not found an offline version of the manual, or a single HTML page version. This makes it a little difficult for reading without internet connectivity.

You can find information at the Official Blog. While I found no Blog Roll, one developer Kyle Banker has a blog.

There are several Mailing Lists available however It seems your best help may be via IRC at irc://irc.freenode.net/#mongodb I found over 200 members on the channel.

There are currently no books on MongoDB however there seems to be 4 books in the making.

MongoDB Experience: Getting Started

Wednesday, June 9th, 2010

Getting started with MongoDB is relatively straight forward, following the instructions from the Quickstart guide has you operational in a few minutes.

I like projects that provide a latest version link for software. There is no need to update any documentation or blog posts over time. The current instructions require some additional steps when creating the initial data directory, due to normal permissions of the root directory. This is the only pre-requisite to using the software out of the box. There is no additional configuration required for a default installation.

$ sudo mkdir /data/db
$ sudo chown `id -u` /data/db

I ran a few boundary tests to verify the error handling of the initial startup process.

The following occurs when the data directory does not exist.

$ ./mongodb-osx-i386-1.4.3/bin/mongod
./mongodb-osx-i386-1.4.3/bin/mongod --help for help and startup options
Tue Jun  8 15:59:52 Mongo DB : starting : pid = 78161 port = 27017 dbpath = /data/db/ master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Tue Jun  8 15:59:52 Assertion: 10296:dbpath (/data/db/) does not exist
0x68572 0x247814 0x24821a 0x24a855 0x1e06
 0   mongod                              0x00068572 _ZN5mongo11msgassertedEiPKc + 514
 1   mongod                              0x00247814 _ZN5mongo14_initAndListenEiPKc + 548
 2   mongod                              0x0024821a _ZN5mongo13initAndListenEiPKc + 42
 3   mongod                              0x0024a855 main + 4917
 4   mongod                              0x00001e06 start + 54
Tue Jun  8 15:59:52   exception in initAndListen std::exception: dbpath (/data/db/) does not exist, terminating
Tue Jun  8 15:59:52  dbexit:
Tue Jun  8 15:59:52 	 shutdown: going to close listening sockets...
Tue Jun  8 15:59:52 	 shutdown: going to flush oplog...
Tue Jun  8 15:59:52 	 shutdown: going to close sockets...
Tue Jun  8 15:59:52 	 shutdown: waiting for fs preallocator...
Tue Jun  8 15:59:52 	 shutdown: closing all files...
Tue Jun  8 15:59:52      closeAllFiles() finished
Tue Jun  8 15:59:52  dbexit: really exiting now

The following error occurs when the user has insufficient permissions for the directory.

$ sudo mkdir /data/db
$ ./mongodb-osx-i386-1.4.3/bin/mongod
./mongodb-osx-i386-1.4.3/bin/mongod --help for help and startup options
Tue Jun  8 16:01:52 Mongo DB : starting : pid = 78178 port = 27017 dbpath = /data/db/ master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Tue Jun  8 16:01:52 User Exception 10309:Unable to create / open lock file for dbpath: /data/db/mongod.lock
Tue Jun  8 16:01:52   exception in initAndListen std::exception: Unable to create / open lock file for dbpath: /data/db/mongod.lock, terminating
Tue Jun  8 16:01:52  dbexit:
Tue Jun  8 16:01:52 	 shutdown: going to close listening sockets...
Tue Jun  8 16:01:52 	 shutdown: going to flush oplog...
Tue Jun  8 16:01:52 	 shutdown: going to close sockets...
Tue Jun  8 16:01:52 	 shutdown: waiting for fs preallocator...
Tue Jun  8 16:01:52 	 shutdown: closing all files...
Tue Jun  8 16:01:52      closeAllFiles() finished
Tue Jun  8 16:01:52 	 shutdown: removing fs lock...
Tue Jun  8 16:01:52 	 couldn't remove fs lock errno:9 Bad file descriptor
Tue Jun  8 16:01:52  dbexit: really exiting now

A missing step from the existing documentation is to set appropriate permissions to the data directory so that mongod run as your normal user can write to the directory.

$ sudo chown `id -u` /data/db
$ ./mongodb-osx-i386-1.4.3/bin/mongod 

Tue Jun  8 16:06:37 Mongo DB : starting : pid = 78284 port = 27017 dbpath = /data/db/ master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Tue Jun  8 16:06:37 db version v1.4.3, pdfile version 4.5
Tue Jun  8 16:06:37 git version: 47ffbdfd53f46edeb6ff54bbb734783db7abc8ca
Tue Jun  8 16:06:37 sys info: Darwin broadway.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_40
Tue Jun  8 16:06:37 waiting for connections on port 27017
Tue Jun  8 16:06:37 web admin interface listening on port 28017

You are then ready to rock and roll. Use the exit command to exit the mongo interactive shell.

$ ./mongodb-osx-i386-1.4.3/bin/mongo
> db.foo.save ({a:2});
> db.foo.find();
{ "_id" : ObjectId("4c0ea308f5ea2f7148a33b9f"), "a" : 2 }
> exit
bye

The mongod output of the first save of data into the “foo” collection shows a little of how mongodb operations. There is a lazy instantiation of “test” database when first required. There is no need to first create this, or use this as with MySQL.

Tue Jun  8 16:07:36 allocating new datafile /data/db/test.ns, filling with zeroes...
Tue Jun  8 16:07:36 done allocating datafile /data/db/test.ns, size: 16MB, took 0.05 secs
Tue Jun  8 16:07:37 allocating new datafile /data/db/test.0, filling with zeroes...
Tue Jun  8 16:07:38 done allocating datafile /data/db/test.0, size: 64MB, took 1.282 secs
Tue Jun  8 16:07:40 building new index on { _id: 1 } for test.foo
Tue Jun  8 16:07:40 Buildindex test.foo idxNo:0 { name: "_id_", ns: "test.foo", key: { _id: 1 } }
Tue Jun  8 16:07:40 done for 0 records 0.018secs
Tue Jun  8 16:07:40 insert test.foo 3181ms
$ ls -l /data/db
total 163848
drwxr-xr-x  2 rbradfor  admin        68 Jun  8 16:07 _tmp
-rwxr-xr-x  1 rbradfor  admin         6 Jun  8 16:06 mongod.lock
-rw-------  1 rbradfor  admin  67108864 Jun  8 16:07 test.0
-rw-------  1 rbradfor  admin  16777216 Jun  8 16:07 test.ns

One observation is the output of the mongod is more a trace output. I have yet to see any information about a more appropriately formatted error log.

MongoDB Experience: History

Wednesday, June 9th, 2010

My first exposure to MongoDB was in July 2008 when I was a panelist on “A Panel on Cloud Computing” at the Entrepreneurs Round Table in New York. The panel included a representative from 10gen the company behind the open source database product and at the time Mongo was described as a full stack solution with the database being only one future component.

While I mentioned Mongo again in a blog in Nov 2008, it was not until Oct 6 2009 at the NoSQL event in New York where I saw a more stable product and a revised focus of development just on the database component.

As the moderator for the closing keynote “SQL v NOSQL” panel at Open SQL Camp 2009 in Portland, Oregon I had the chance to discuss MongoDB with the other products in the NoSQL space. Watch Video

In just the past few weeks, 3 people independently have mentioned MongoDB and asked for my input. I was disappointed to just miss the MongoNYC 2010 event.

While I have evaluated various new products in the key/value store and the schemaless space, my curiosity has been initially more with Cassandra and CouchDB.

Follow my journey as I explore in more detail the usage of mongoDB {name: “mongo”, type:”db”} via the mongodb tag on my blog.

A Cassandra twitter clone

Thursday, February 25th, 2010

Following my successful Cassandra Cluster setup and having a potential client example to work with running Ruby On Rails (RoR), I came across the following examples in Ruby.

Not being a ruby developer, I thought it was time to investigate further. Starting first on Mac OS X 10.5, I found the first line example of installing cassandra via gem unsuccessful.

$ gem install cassandra
Updating metadata for 1 gems from http://gems.rubyforge.org
.
complete
ERROR:  could not find cassandra locally or in a repository

Some more reading highlights Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

In case you didn’t read my earlier posts, Java 6 is installed, but not the default.

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export PATH=$JAVA_HOME/bin:$PATH

I achieved installing RubyGems via Installing Ruby on Rails on Mac OS X.

$ sudo gem install rubygems-update

Updating metadata for 1 gems from http://gems.rubyforge.org
.
complete
Successfully installed rubygems-update-1.3.6
1 gem installed
Installing ri documentation for rubygems-update-1.3.6...
Installing RDoc documentation for rubygems-update-1.3.6...
Could not find main page README
Could not find main page README
Could not find main page README
Could not find main page README

$ sudo update_rubygems
RubyGems 1.3.6 installed

=== 1.3.6 / 2010-02-17

NOTE:

http://rubygems.org is now the default source for downloading gems.

You may have sources set via ~/.gemrc, so you should replace
http://gems.rubyforge.org with http://rubygems.org

http://gems.rubyforge.org will continue to work for the forseeable future.

New features:

* `gem` commands
  * Added `gem push` and `gem owner` for interacting with modern/Gemcutter
    sources
  * `gem dep` now supports --prerelease.
  * `gem fetch` now supports --prerelease.
  * `gem server` now supports --bind.  Patch #27357 by Bruno Michel.
  * `gem rdoc` no longer overwrites built documentation.  Use --overwrite
    force rebuilding.  Patch #25982 by Akinori MUSHA.
* Captial letters are now allowed in prerelease versions.

Bug fixes:

* Development deps are no longer added to rubygems-update gem so older
  versions can update sucessfully.
* Installer bugs:
  * Prerelease gems can now depend on non-prerelease gems.
  * Development dependencies are ignored unless explicitly needed.  Bug #27608
    by Roger Pack.
* `gem` commands
  * `gem which` now fails if no paths were found.  Adapted patch #27681 by
    Caio Chassot.
  * `gem server` no longer has invalid markup.  Bug #27045 by Eric Young.
  * `gem list` and friends show both prerelease and regular gems when
    --prerelease --all is given
* Gem::Format no longer crashes on empty files.  Bug #27292 by Ian Ragsdale.
* Gem::GemPathSearcher handles nil require_paths. Patch #27334 by Roger Pack.
* Gem::RemoteFetcher no longer copies the file if it is where we want it.
  Patch #27409 by Jakub Šťastný.

Deprecation Notices:

* lib/rubygems/timer.rb has been removed.
* Gem::Dependency#version_requirements is deprecated and will be removed on or
  after August 2010.
* Bulk index update is no longer supported.
* Gem::manage_gems was removed in 1.3.3.
* Time::today was removed in 1.3.3.

------------------------------------------------------------------------------

RubyGems installed the following executables:
	/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/gem

NOTE: This second command took over 60 seconds with no user feedback.

I was then able to successfully install cassandra via ruby’s gem package manager.

$ sudo gem install cassandra
Building native extensions.  This could take a while...
Successfully installed thrift-0.2.0
Successfully installed thrift_client-0.4.0
Successfully installed simple_uuid-0.1.0
Successfully installed cassandra-0.7.5
4 gems installed
Installing ri documentation for thrift-0.2.0...

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known
Installing ri documentation for thrift_client-0.4.0...
Installing ri documentation for simple_uuid-0.1.0...
Installing ri documentation for cassandra-0.7.5...
Installing RDoc documentation for thrift-0.2.0...

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known
Installing RDoc documentation for thrift_client-0.4.0...
Installing RDoc documentation for simple_uuid-0.1.0...
Installing RDoc documentation for cassandra-0.7.5...

My use of cassandra_helper provided the following expected dependency error.

$ cassandra_helper cassandra
Set the CASSANDRA_INCLUDE environment variable to use a non-default cassandra.in.sh and friends.
(in /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5)
You need to install git 1.6 or 1.7

I found instructions to install git at Installing git (OSX) and installed via GUI installer.

I had to include to my current session path to get my Ruby Cassandra installation.

$ export PATH=/usr/local/git/bin:$PATH

$ cassandra_helper cassandra
Set the CASSANDRA_INCLUDE environment variable to use a non-default cassandra.in.sh and friends.
(in /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5)
Checking Cassandra out from git
Initialized empty Git repository in /Users/rbradfor/cassandra/server/.git/
remote: Counting objects: 16715, done.
remote: Compressing objects: 100% (2707/2707), done.
remote: Total 16715 (delta 9946), reused 16011 (delta 9364)
Receiving objects: 100% (16715/16715), 19.22 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (9946/9946), done.
Updating Cassandra.
Buildfile: build.xml

clean:

BUILD SUCCESSFUL
Total time: 2 seconds
HEAD is now at 298a0e6 check-in debian packaging
Building Cassandra
Buildfile: build.xml

build-subprojects:

init:
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/build/classes
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/build/test/classes
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/src/gen-java

check-gen-cli-grammar:

gen-cli-grammar:
     [echo] Building Grammar /Users/rbradfor/cassandra/server/src/java/org/apache/cassandra/cli/Cli.g  ....

build-project:
     [echo] apache-cassandra-incubating: /Users/rbradfor/cassandra/server/build.xml
    [javac] Compiling 247 source files to /Users/rbradfor/cassandra/server/build/classes
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.

build:

BUILD SUCCESSFUL
Total time: 42 seconds
CASSANDRA_HOME: /Users/rbradfor/cassandra/server
CASSANDRA_CONF: /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5/conf
Listening for transport dt_socket at address: 8888
DEBUG - Loading settings from /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5/conf/storage-conf.xml
....

I was then able to complete the example at up and running with cassandra running via the ruby interactive console.

I was also able to fire up the cassandra-cli and see the data added in ruby.

$ bin/cassandra-cli -host localhost
Connected to localhost/9160
cassandra> get Twitter.Statuses['1']
=> (column=user_id, value=5, timestamp=1267072406503471)
=> (column=text, value=Nom nom nom nom nom., timestamp=1267072406503471)
Returned 2 results.
cassandra> get Twitter.UserRelationships['5'];
=> (super_column=user_timeline,
     (column=???!??zvZ+?!, value=1, timestamp=1267072426991872)
     (column=??-?!???C?th?, value=2, timestamp=1267072427019091))
Returned 1 results.

No sure about the data in the second example.

Configuring a Cassandra Cluster

Wednesday, February 24th, 2010

Continuing on from Getting started with Cassandra I’m now trying to configure two servers as a cluster. The Getting Started Step 3 was not clear the first time I read it (after writing this is makes sense), so a Google search yielded the second link as Building a Small Cassandra Cluster for Testing and Development. I love finding reference material from people I know, Padraig being a significant contributor to Drizzle.

Here is what I did to create a running Cassandra Cluster.

  • Stop individual Cassandra instances
  • Re-created data and log directories (I did this just to ensure a clean slate)
  • I added to my local hosts file two aliases for my servers (cass01 and cass02). This helped in the following step.
  • Three changes are needed to the default conf/storage-conf.xml file on my first server.
    • Change <ListenAddress> from localhost to cass01
    • Change <ThristAddress> from localhost to cass01
    • Change <Seed> from 127.0.0.1 to cass01
  • On my second server I changed the <ListenAddress> and <ThriftAddress> accordingly to cass02 and made <Seed> cass01
  • Started Cassandra servers and tested successfully using the set …/get Keyspace1.Standard1['jsmith'] example. I was able to connect to both hosts via cassandra-cli and see the results created on just one node. I was able to create data on the second node and view on the first node.

A new command is available to describe your cluster.

$ bin/nodeprobe -host cass01 ring
Address       Status     Load          Range                                      Ring
                                       148029780173059661585165369000220362256
192.168.100.4 Up         0 bytes       59303445267720348277007645348152900920     |<--|
192.168.100.5 Up         0 bytes       148029780173059661585165369000220362256    |-->|

Now with my first introduction successful, time to start using and seeing the true power of using Cassandra.

Getting started with Cassandra

Tuesday, February 23rd, 2010

With the motivation from today’s public news on Twitter’s move from MySQL to Cassandra, my own skills desire following in-depth discussions at last November’s Open SQL Camp to consider Cassandra and yesterday’s discussion with a new client on persistent key-value store products, today I download installed and configured for the first time. Not that today’s news was unexpected, if you follow the Twitter Engineering Open Source projects you would have seen Cassandra as well as other products being used or evaluated by Twitter.

So I went from nothing to a working Cassandra node in under 5 minutes. This is what I did.

  1. While I knew this was an Apache project, a Google Search yields for me the 3rd link for the The Apache Cassandra Project at http://incubator.apache.org/cassandra/. Congrats for Cassandra now a top level Apache Project. This url will update soon.
  2. Download Cassandra. Hard to miss with a big green button on home page. Current version is 0.5
  3. I read Getting Started, which is the 3rd top level link on menu after Home and Download. Step 1 is picking a version which I’ve already done, Step 2 is Running a single node.
  4. The Getting Started indicated a problem on Mac OS X for the required minimum Java version. I was installing on Mac OS X 10.5 and CentOS 5.4. I’ve experienced this Java 6 default path issue before. Set my JAVA_HOME and PATH accordingly (after I updated the wiki with correct value)
  5. I extracted the tar file, changed to the directory and took at look at the README.txt file. Yes, I always check this first with any software and relevant because it includes valuable instructions on creating the default data and log directories.
  6. Start with bin/cassandra -f. No problems!
  7. I then followed the instructions from the link in Step 2 with the CassandraCli. This tests and confirms the installation is operational.

Ok, a working environment. I’ve now installed on a second machine and tested however I now need to configure the cluster, and the documentation is not as straightforward. Time to try out Google again.

On a side note, this is one reason why I love Open Source. I followed the instructions online and found a mistake in the Mac OS X path, I simply registered and corrected providing the benefit of my experience for the next reader(s).

You may also like to view future posts including.