Aggregation & mapReduce in Mongo
February 10, 2012 1 Comment
In Oracle, or SQL Server or any other RDBMS, SQL makes it very easy to do aggregation by using GROUP BY.
SELECT make, count(model) FROM Vehicle GROUP by make order by make
In Mongo, Its slightly more complex to build group by aggregation from collections. There are many ways of getting different counts out, and some are explored below.
DISTINCT
//Distinct all Makes of cars in Collection db.vehicle.distinct('model');
The above is the same as doing a Select Distinct in SQL. The data returned is all distinct types of vehicle models in the collection
COUNT
//Count all Ford Cards db.vehicle.count({make:"FORD"});
The above will count all ford cars in the vehicle collection
GROUP
Generally only applicable for collections of up to 10,000 documents. Errors have been reported if attempted on larger sets. Use mapreduce for larger data collections.
// Group by make db.vehicle.group( {key: { make:true}, reduce: function(obj,prev) { prev.count++;}, initial: { count: 0 } });
The result is something a bit like this:
{
"make" : "ALFA ROMEO","count" : 25
},
{
"make" : "CORVETTE",
"count" : 1
},
{
"make" : "HONDA",
"count" : 2
},
The collection is being aggregated at the make level. The reduce function is passing a aggregated count back to the object using the key criteria. This can be further enhanced by adding in the model document
// Group by make, model db.vehicle.group( {key: { make:true, model:true}, reduce: function(obj,prev) { prev.count++;}, initial: { count: 0 } });
The result is something like this:
{
"make" : "ALFA ROMEO","model" : "MITO",
"count" : 2
},
{
"make" : "ASTON MARTIN",
"model" : "VIRAGE",
"count" : 1
},
{
"make" : "CORVETTE",
"model" : "C6",
"count" : 1
},
The above 2 example use a cumulative count:
prev.count++;
but the same can also be achieved by doing:
prev.count += 1
MAP REDUCE
Map-reduce is made up of two operations. The map part formats the data into key-value pairs, and the reduce builds the data set.
Eg: we want to return the count of the number of values for each type of make of vehicle.
//Map Function var map = function() { emit( this.make, {model: this.model} ); }
Create the MAP function. We are essentially saying return me the make of vehicle, but for that vehicle, my aggregator is the make document.
//Reduce Function //Add up all the votes for each key. var reduce = function(key, values) { var sum = 0; values.forEach(function(doc) { sum += 1; }); return {model: sum}; }
Create the REDUCE function. the MAP values are automatically passed into the reduce function. Note, that at this stage we havnt actually specified the collection we want to run this mapReduce on.
//var into output **CAREFUL** - saves it as a new collection //would overwrite existing collection var op = db.vehicle.mapReduce(map,reduce, {out : "myoutput"} );
The above saves the results of the mapReduce into a myoutput collection.
//return results db[op.result].find();
By doing the above, the results are returned to screen.
PRIMARY> db[op.result].find();
{ "_id" : "ALFA ROMEO", "value" : { "model" : 25 } }{ "_id" : "ASTON MARTIN", "value" : { "model" : 48 } }
{ "_id" : "AUDI", "value" : { "model" : 33 } }
{ "_id" : "BENTLEY", "value" : { "model" : 8 } }
{ "_id" : "BMW", "value" : { "model" : 32 } }
{ "_id" : "CHRYSLER", "value" : { "model" : "YPSILON" } }
{ "_id" : "CITROEN", "value" : { "model" : "C4 PICASSO" } }
{ "_id" : "CORVETTE", "value" : { "model" : "C6" } }
{ "_id" : "FIAT", "value" : { "model" : 9 } }
{ "_id" : "FORD", "value" : { "model" : 14 } }
{ "_id" : "HONDA", "value" : { "model" : 2 } }
{ "_id" : "HYUNDAI", "value" : { "model" : 5 } }
{ "_id" : "JAGUAR", "value" : { "model" : 3 } }
{ "_id" : "KIA", "value" : { "model" : 2 } }
{ "_id" : "LAMBORGHINI", "value" : { "model" : 24 } }
{ "_id" : "LAND ROVER", "value" : { "model" : 8 } }
{ "_id" : "LOTUS", "value" : { "model" : 5 } }
{ "_id" : "MASERATI", "value" : { "model" : 6 } }
{ "_id" : "MAZDA", "value" : { "model" : 2 } }
{ "_id" : "MERCEDES-BENZ", "value" : { "model" : 7 } }
has more
Note that not all documents were shown – the shell limits the number to 20 when automatically iterating a cursor. If we want to return the next set of results, there’s the it shortcut
has more
PRIMARY>PRIMARY> it
{ "_id" : "MINI", "value" : { "model" : 11 } }
{ "_id" : "MITSUBISHI", "value" : { "model" : "COLT" } }
{ "_id" : "NISSAN", "value" : { "model" : 5 } }
{ "_id" : "PERODUA", "value" : { "model" : 51 } }
{ "_id" : "PORSCHE", "value" : { "model" : 10 } }
{ "_id" : "PROTON", "value" : { "model" : 2 } }
{ "_id" : "RENAULT", "value" : { "model" : 4 } }
{ "_id" : "ROLLS-ROYCE", "value" : { "model" : "PHANTOM" } }
{ "_id" : "SSANGYONG", "value" : { "model" : "RODIUS" } }
{ "_id" : "SUZUKI", "value" : { "model" : 2 } }
{ "_id" : "TESLA", "value" : { "model" : "ROADSTER" } }
{ "_id" : "TOYOTA", "value" : { "model" : "PRIUS" } }
{ "_id" : "VOLKSWAGEN", "value" : { "model" : 5 } }
{ "_id" : "VOLVO", "value" : { "model" : 7 } }
To Clean up:
//Drop the output collection db.myoutput.drop();
See references:
Mongo Docs: http://www.mongodb.org/display/DOCS/Aggregation
MongoDB Aggregation: http://kylebanker.com/blog/2009/11/mongodb-count-group/
Map Reduce: http://www.mongodb.org/display/DOCS/MapReduce
Pingback: Select Distinct Count in MongoDB « Mohsin's DBA Blog