Provide example of computing summary data in MongoDB #5

NicMcPhee · 2018-02-26T18:23:09Z

We ask them to use MongoDB to compute some summary data, but they have no examples of that in the starter code. We should fix that.

kklamberty · 2018-12-12T05:58:26Z

We have discussed adding some kind of summary data about users.

NicMcPhee · 2018-12-12T17:32:58Z

Do you remember more specifically what we had been thinking of? Something like the number of people for each company would be a start, but that's a lot simpler than what we ask them to do. Maybe number of people for each company, split into two or three groups by age?

kklamberty · 2018-12-12T17:37:35Z

I don't think we had a specific plan, but I like your idea of sorting/grouping by age. This kind of information might be helpful for a real company to see who might need/want information on retirement, so we can make a decent use case. We could do something less interesting with a ridiculous use case like for some reason we care how many letters are in someone's name? How many start with a certain letter? Maybe we are making monogrammed towels to save on paper towel waste and want to know how many of each type to make? Or, we need to estimate the cost of personalizing the towels to be green and also stop the spread of germs?

NicMcPhee · 2019-02-20T19:06:46Z

So I'm going with a summary by company, followed by age, so something like:

[
   {
      company: String,
      ageBreakdown: {
         numUnder30: Integer,
         between30and55: Integer,
         over55: Integer
     }
  },
  ...
]

I used https://next.json-generator.com/ to generate the data using this template:

  {
    'repeat(100)': {
      _id: {
        "oid": '{{objectId()}}',
      },
      name: '{{firstName()}} {{surname()}}',
      age: '{{integer(20, 75)}}',
      company: "{{random('Velity', 'Dognost', 'Caxt', 'Eschoir', 'Blurrybus', 'Overplex')}}",
      email(tags) {
        return `${this.name.split(" ")[0]}.${this.name.split(" ")[1]}@${this.company}${tags.domainZone()}`.toLowerCase();
      },
    },
  }
]

NicMcPhee · 2019-02-22T21:28:12Z

This is essentially a problem of nested groups, as we want to group by both company and age range. That didn't turn out to be trivial. After a fair bit of flailing, I eventually came up with the following query:

db.users.aggregate( [ 
   { $bucket: { groupBy: "$age", boundaries: [ 0, 30, 56, 200 ], 
         output: { companies: { $push: "$company" } } } },
   { $unwind: "$companies" }, 
   { $group: { _id: { bkt: "$_id", company: "$companies" }, count: { $sum: 1 } } },
   { $project: { _id: "$_id.company", bucket: "$_id.bkt", count: "$count" } },
   { $sort: { _id: 1, bucket: 1 } } ] )

that returns:

{ "_id" : "Blurrybus", "bucket" : 0, "count" : 10 }
{ "_id" : "Blurrybus", "bucket" : 30, "count" : 19 }
{ "_id" : "Blurrybus", "bucket" : 56, "count" : 13 }
{ "_id" : "Caxt", "bucket" : 0, "count" : 6 }
{ "_id" : "Caxt", "bucket" : 30, "count" : 18 }
{ "_id" : "Caxt", "bucket" : 56, "count" : 24 }
{ "_id" : "Dognost", "bucket" : 0, "count" : 9 }
{ "_id" : "Dognost", "bucket" : 30, "count" : 17 }
{ "_id" : "Dognost", "bucket" : 56, "count" : 11 }
{ "_id" : "Eschoir", "bucket" : 0, "count" : 7 }
{ "_id" : "Eschoir", "bucket" : 30, "count" : 16 }
{ "_id" : "Eschoir", "bucket" : 56, "count" : 15 }
{ "_id" : "Overplex", "bucket" : 0, "count" : 4 }
{ "_id" : "Overplex", "bucket" : 30, "count" : 21 }
{ "_id" : "Overplex", "bucket" : 56, "count" : 21 }
{ "_id" : "Velity", "bucket" : 0, "count" : 8 }
{ "_id" : "Velity", "bucket" : 30, "count" : 15 }
{ "_id" : "Velity", "bucket" : 56, "count" : 16 }

which can easily be restructured into the desired form by the server.

This query is quite complicated, and I have a feeling that using the map-reduce form of aggregation might have been easier, but I'm not sure of that. Below I'll try to document what's happening in the query.

$bucket

The first step takes all the users and puts them in buckets depending on whether their age is in the range [0, 30), [30, 56), or [56, 200). The ranges are exclusive on the right, so the middle bucket is between 30 and 55, inclusive on both ends. I'm assuming here that no one is older than 200; an alternative would have been to use the default option to $bucket for the top bucket.

The $push creates an array of all the companies for each entry in each bucket, so on it's own this step yields something like:

{ "_id" : 0, "companies" : [ "Dognost", "Eschoir", "Caxt", "Velity", "Blurrybus", ...] }
{ "_id" : 30, "companies" : [ "Eschoir", "Dognost", "Dognost", "Overplex", … ] }
{ "_id" : 56, "companies" : [ "Blurrybus", "Dognost", … ]}

$unwind

This "unwinds" all those arrays of companies so we have separate entries for each bucket/company pair:

{ "_id" : 0, "companies" : "Dognost" }
{ "_id" : 0, "companies" : "Eschoir" }
{ "_id" : 0, "companies" : "Caxt" }
{ "_id" : 0, "companies" : "Velity" }
{ "_id" : 0, "companies" : "Blurrybus" }
...

Here the _id is actually the age bucket (0, 30, or 56).

$group (and $project and $sort)

The $group aggregation groups by company and age bucket, and counts how many there are, yielding something like:

{ "_id" : { "bkt" : 56, "company" : "Eschoir" }, "count" : 15 }
{ "_id" : { "bkt" : 56, "company" : "Velity" }, "count" : 16 }
{ "_id" : { "bkt" : 56, "company" : "Caxt" }, "count" : 24 }
{ "_id" : { "bkt" : 56, "company" : "Dognost" }, "count" : 11 }
{ "_id" : { "bkt" : 56, "company" : "Blurrybus" }, "count" : 13 }
{ "_id" : { "bkt" : 30, "company" : "Caxt" }, "count" : 18 }
{ "_id" : { "bkt" : 30, "company" : "Blurrybus" }, "count" : 19 }
{ "_id" : { "bkt" : 0, "company" : "Blurrybus" }, "count" : 10 }
{ "_id" : { "bkt" : 0, "company" : "Caxt" }, "count" : 6 }
{ "_id" : { "bkt" : 0, "company" : "Velity" }, "count" : 8 }
{ "_id" : { "bkt" : 0, "company" : "Eschoir" }, "count" : 7 }
{ "_id" : { "bkt" : 30, "company" : "Dognost" }, "count" : 17 }
{ "_id" : { "bkt" : 0, "company" : "Dognost" }, "count" : 9 }
{ "_id" : { "bkt" : 0, "company" : "Overplex" }, "count" : 4 }
{ "_id" : { "bkt" : 30, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 56, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 30, "company" : "Eschoir" }, "count" : 16 }
{ "_id" : { "bkt" : 30, "company" : "Velity" }, "count" : 15 }

At this point we're arguably done as this has all the data in a form we could use on the server. The following $project and $sort just reorganize and sort the data into a form that will make it slightly easier for the server to process.

NicMcPhee · 2019-02-22T23:11:04Z

I think that it is indeed easier using mapReduce(), although that's not trivial either. My mapReduce solution (that works in the Mongo Shell - I still need to convert this to Java) is:

var mapFunction = function() { 
    emit({ company: this.company, 
           ageBracket: (this.age<30?"under30":((this.age<=55)?"between30and55":"over55")) }, 
         1); 
};

var reduceFunction = function(k, vs) { return Array.sum(vs) }

db.users.mapReduce(mapFunction, reduceFunction, { out: { inline: 1 } })

which returns:

{
	"results" : [
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "between30and55"
			},
			"value" : 19
		},
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "over55"
			},
			"value" : 13
		},
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "under30"
			},
			"value" : 10
		},
		{
			"_id" : {
				"company" : "Caxt",
				"ageBracket" : "between30and55"
			},
			"value" : 18
...

The output could be cleaned up some with a finalize clause, but we wouldn't be able to (easily) group companies together into a single entry without introducing a whole grouping pipeline after this. (That would work, but it's not clear if we care enough?)

I think there's probably an even cleaner mapReduce() solution, but this at least shows that it's possible and not too weird. I really don't like the use of the ?: ternary operator, and even worse it's nested! I suspect I can avoid that when I switch to Java.

This basically completes at least one version of the server-side code for issue #23, providing an example of computing summary data from the database. This uses MongoDB's map-reduce tool, and it's not _too_ bad except for all the insane JSON/BSON document manipulation that's necessary to do this kind of stuff in Java. Ugh. There's still no client-side code for this, and no write-up or documentation anywhere. There probably should be some substantial comments added to the code as well.

kklamberty · 2019-09-25T19:01:16Z

We created a related issue and closed that. I'm not sure that we should close this issue, but I am tempted to close it. For now, I will take off the high priority label since it is no longer urgent. It's still interesting, and I'm not sure what to do, explicitly, with the issue itself... nor am I sure how to incorporate this meaningfully in a lab. I personally think this would be a wonderful example to flesh out and make available for students to use as a model when they are working on the project.

kklamberty · 2020-02-19T17:33:24Z

Even if we don't require student to do anything with this, it would be really neat to link to a couple of examples of doing this. Those examples could be part of what we give the students and we could even include multiple ways of doing this work to show them the possibilities. Especially since @NicMcPhee did a lot of work to figure this out and it's all right here, it would make sense to include it in the lab for students.

kklamberty · 2020-02-19T18:19:16Z

@wallerli or @floogulinc - Do either of you have suggestions about how to handle this task? @NicMcPhee tried a couple of things (listed and described in detail above), but we thought maybe you'd have some ideas.

NicMcPhee · 2024-02-13T17:10:47Z

So I sort-of did a thing that's at least related to this in #81, which aggregates the user data by company, creating a list of companies, each of which contains a list of Users (employees) projected down to just their name and ID. I then added UI and E2E/Junit tests (not so much Karma) for the code.

In a perfect world, the lists of user names would actually be links to the user profiles, but I ran out of time, so that can be a feature for another day.

I still like the idea of summarizing companies by age of employees that is described above, so I'm leaving this open in the hopes that one day we'll get there.

I found GitHub CoPilot quite useful in creating and debugging queries. There's also a MongoDB specific AI chatbot available when you're in the MongoDB docs; I also found this somewhat useful.

kklamberty assigned NicMcPhee Dec 12, 2018

kklamberty closed this as completed Sep 25, 2019

kklamberty reopened this Sep 25, 2019

kklamberty transferred this issue from UMM-CSci-3601/3601-lab4_mongo-pre-S20 Feb 19, 2020

floogulinc transferred this issue from UMM-CSci-3601/3601-lab4-mongo-2020 Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide example of computing summary data in MongoDB #5

Provide example of computing summary data in MongoDB #5

NicMcPhee commented Feb 26, 2018

kklamberty commented Dec 12, 2018

NicMcPhee commented Dec 12, 2018

kklamberty commented Dec 12, 2018

NicMcPhee commented Feb 20, 2019 •

edited

Loading

NicMcPhee commented Feb 22, 2019

NicMcPhee commented Feb 22, 2019

kklamberty commented Sep 25, 2019

kklamberty commented Feb 19, 2020

kklamberty commented Feb 19, 2020

NicMcPhee commented Feb 13, 2024

Provide example of computing summary data in MongoDB #5

Provide example of computing summary data in MongoDB #5

Comments

NicMcPhee commented Feb 26, 2018

kklamberty commented Dec 12, 2018

NicMcPhee commented Dec 12, 2018

kklamberty commented Dec 12, 2018

NicMcPhee commented Feb 20, 2019 • edited Loading

NicMcPhee commented Feb 22, 2019

NicMcPhee commented Feb 22, 2019

kklamberty commented Sep 25, 2019

kklamberty commented Feb 19, 2020

kklamberty commented Feb 19, 2020

NicMcPhee commented Feb 13, 2024

NicMcPhee commented Feb 20, 2019 •

edited

Loading