Count Distinct Values via aggregation framework

Q: Is it possible to count distinct values of a field in mongodb?

A: Yes! This can be done via the aggregation framework in mongo. This takes two group commands; the first groups by all the distinct values, and the second does a count of them all.

pipeline = [ 
    { $group: { _id: "$myNonUniqueFieldId"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

db.runCommand( 
    {
    "aggregate": "collection" , 
    "pipeline": pipeline
    }
);
Advertisements

Aggregation & mapReduce in Mongo

In Oracle, or SQL Server or any other RDBMS, SQL makes it very easy to do aggregation by using GROUP BY.

SELECT make, count(model) FROM Vehicle
GROUP by make 
order by make

In Mongo, Its slightly more complex to build group by aggregation from collections. There are many ways of getting different counts out, and some are explored below.

DISTINCT

//Distinct all Makes of cars in Collection
db.vehicle.distinct('model');

The above is the same as doing a Select Distinct in SQL. The data returned is all distinct types of vehicle models in the collection

COUNT

//Count all Ford Cards
db.vehicle.count({make:"FORD"});

The above will count all ford cars in the vehicle collection

GROUP

Generally only applicable for collections of up to 10,000 documents. Errors have been reported if attempted on larger sets. Use mapreduce for larger data collections.

// Group by make
db.vehicle.group(
           {key: { make:true},
            reduce: function(obj,prev) { prev.count++;},
            initial: { count: 0 }
            });

The result is something a bit like this:

{

        "make" : "ALFA ROMEO",

        "count" : 25

},

{

        "make" : "CORVETTE",

        "count" : 1

},

{

        "make" : "HONDA",

        "count" : 2

},

The collection is being aggregated at the make level. The reduce function is passing a aggregated count back to the object using the key criteria. This can be further enhanced by adding in the model document

// Group by make, model
db.vehicle.group(
           {key: { make:true, model:true},
            reduce: function(obj,prev) { prev.count++;},
            initial: { count: 0 }
            });

The result is something like this:

{

        "make" : "ALFA ROMEO",

        "model" : "MITO",

        "count" : 2

},

{

        "make" : "ASTON MARTIN",

        "model" : "VIRAGE",

        "count" : 1

},

{

        "make" : "CORVETTE",

        "model" : "C6",

        "count" : 1

},

The above 2 example use a cumulative count:

prev.count++;

but the same can also be achieved by doing:

prev.count += 1

MAP REDUCE

Map-reduce is made up of two operations. The map part formats the data into key-value pairs, and the reduce builds the data set.

Eg: we want to return the count of the number of values for each type of make of vehicle.

//Map Function
var map = function() {
    emit( this.make, {model: this.model} );
  }

Create the MAP function. We are essentially saying return me the make of vehicle, but for that vehicle, my aggregator is the make document.

//Reduce Function
//Add up all the votes for each key.
var reduce = function(key, values) {
  var sum = 0;
  values.forEach(function(doc) {
    sum += 1;
  });
  return {model: sum};
}

Create the REDUCE function. the MAP values are automatically passed into the reduce function. Note, that at this stage we havnt actually specified the collection we want to run this mapReduce on.

//var into output **CAREFUL** - saves it as a new collection 
//would overwrite existing collection
var op = db.vehicle.mapReduce(map,reduce, {out : "myoutput"} );

The above saves the results of the mapReduce into a myoutput collection.

//return results
db[op.result].find();

By doing the above, the results are returned to screen.

PRIMARY> db[op.result].find();

{ "_id" : "ALFA ROMEO", "value" : { "model" : 25 } }

{ "_id" : "ASTON MARTIN", "value" : { "model" : 48 } }

{ "_id" : "AUDI", "value" : { "model" : 33 } }

{ "_id" : "BENTLEY", "value" : { "model" : 8 } }

{ "_id" : "BMW", "value" : { "model" : 32 } }

{ "_id" : "CHRYSLER", "value" : { "model" : "YPSILON" } }

{ "_id" : "CITROEN", "value" : { "model" : "C4 PICASSO" } }

{ "_id" : "CORVETTE", "value" : { "model" : "C6" } }

{ "_id" : "FIAT", "value" : { "model" : 9 } }

{ "_id" : "FORD", "value" : { "model" : 14 } }

{ "_id" : "HONDA", "value" : { "model" : 2 } }

{ "_id" : "HYUNDAI", "value" : { "model" : 5 } }

{ "_id" : "JAGUAR", "value" : { "model" : 3 } }

{ "_id" : "KIA", "value" : { "model" : 2 } }

{ "_id" : "LAMBORGHINI", "value" : { "model" : 24 } }

{ "_id" : "LAND ROVER", "value" : { "model" : 8 } }

{ "_id" : "LOTUS", "value" : { "model" : 5 } }

{ "_id" : "MASERATI", "value" : { "model" : 6 } }

{ "_id" : "MAZDA", "value" : { "model" : 2 } }

{ "_id" : "MERCEDES-BENZ", "value" : { "model" : 7 } }

has more

Note that not all documents were shown – the shell limits the number to 20 when automatically iterating a cursor. If we want to return the next set of results, there’s the it shortcut

has more

PRIMARY>

PRIMARY> it

{ "_id" : "MINI", "value" : { "model" : 11 } }

{ "_id" : "MITSUBISHI", "value" : { "model" : "COLT" } }

{ "_id" : "NISSAN", "value" : { "model" : 5 } }

{ "_id" : "PERODUA", "value" : { "model" : 51 } }

{ "_id" : "PORSCHE", "value" : { "model" : 10 } }

{ "_id" : "PROTON", "value" : { "model" : 2 } }

{ "_id" : "RENAULT", "value" : { "model" : 4 } }

{ "_id" : "ROLLS-ROYCE", "value" : { "model" : "PHANTOM" } }

{ "_id" : "SSANGYONG", "value" : { "model" : "RODIUS" } }

{ "_id" : "SUZUKI", "value" : { "model" : 2 } }

{ "_id" : "TESLA", "value" : { "model" : "ROADSTER" } }

{ "_id" : "TOYOTA", "value" : { "model" : "PRIUS" } }

{ "_id" : "VOLKSWAGEN", "value" : { "model" : 5 } }

{ "_id" : "VOLVO", "value" : { "model" : 7 } }

To Clean up:

//Drop the output collection
db.myoutput.drop();

See references:

Mongo Docs: http://www.mongodb.org/display/DOCS/Aggregation

MongoDB Aggregation: http://kylebanker.com/blog/2009/11/mongodb-count-group/

Map Reduce: http://www.mongodb.org/display/DOCS/MapReduce