Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide example of computing summary data in MongoDB #5

Open
NicMcPhee opened this issue Feb 26, 2018 · 10 comments
Open

Provide example of computing summary data in MongoDB #5

NicMcPhee opened this issue Feb 26, 2018 · 10 comments
Assignees

Comments

@NicMcPhee
Copy link
Member

We ask them to use MongoDB to compute some summary data, but they have no examples of that in the starter code. We should fix that.

@kklamberty
Copy link
Member

We have discussed adding some kind of summary data about users.

@NicMcPhee
Copy link
Member Author

Do you remember more specifically what we had been thinking of? Something like the number of people for each company would be a start, but that's a lot simpler than what we ask them to do. Maybe number of people for each company, split into two or three groups by age?

@kklamberty
Copy link
Member

I don't think we had a specific plan, but I like your idea of sorting/grouping by age. This kind of information might be helpful for a real company to see who might need/want information on retirement, so we can make a decent use case. We could do something less interesting with a ridiculous use case like for some reason we care how many letters are in someone's name? How many start with a certain letter? Maybe we are making monogrammed towels to save on paper towel waste and want to know how many of each type to make? Or, we need to estimate the cost of personalizing the towels to be green and also stop the spread of germs?

@NicMcPhee
Copy link
Member Author

NicMcPhee commented Feb 20, 2019

So I'm going with a summary by company, followed by age, so something like:

[
   {
      company: String,
      ageBreakdown: {
         numUnder30: Integer,
         between30and55: Integer,
         over55: Integer
     }
  },
  ...
]

I used https://next.json-generator.com/ to generate the data using this template:

  {
    'repeat(100)': {
      _id: {
        "oid": '{{objectId()}}',
      },
      name: '{{firstName()}} {{surname()}}',
      age: '{{integer(20, 75)}}',
      company: "{{random('Velity', 'Dognost', 'Caxt', 'Eschoir', 'Blurrybus', 'Overplex')}}",
      email(tags) {
        return `${this.name.split(" ")[0]}.${this.name.split(" ")[1]}@${this.company}${tags.domainZone()}`.toLowerCase();
      },
    },
  }
]

@NicMcPhee
Copy link
Member Author

This is essentially a problem of nested groups, as we want to group by both company and age range. That didn't turn out to be trivial. After a fair bit of flailing, I eventually came up with the following query:

db.users.aggregate( [ 
   { $bucket: { groupBy: "$age", boundaries: [ 0, 30, 56, 200 ], 
         output: { companies: { $push: "$company" } } } },
   { $unwind: "$companies" }, 
   { $group: { _id: { bkt: "$_id", company: "$companies" }, count: { $sum: 1 } } },
   { $project: { _id: "$_id.company", bucket: "$_id.bkt", count: "$count" } },
   { $sort: { _id: 1, bucket: 1 } } ] )

that returns:

{ "_id" : "Blurrybus", "bucket" : 0, "count" : 10 }
{ "_id" : "Blurrybus", "bucket" : 30, "count" : 19 }
{ "_id" : "Blurrybus", "bucket" : 56, "count" : 13 }
{ "_id" : "Caxt", "bucket" : 0, "count" : 6 }
{ "_id" : "Caxt", "bucket" : 30, "count" : 18 }
{ "_id" : "Caxt", "bucket" : 56, "count" : 24 }
{ "_id" : "Dognost", "bucket" : 0, "count" : 9 }
{ "_id" : "Dognost", "bucket" : 30, "count" : 17 }
{ "_id" : "Dognost", "bucket" : 56, "count" : 11 }
{ "_id" : "Eschoir", "bucket" : 0, "count" : 7 }
{ "_id" : "Eschoir", "bucket" : 30, "count" : 16 }
{ "_id" : "Eschoir", "bucket" : 56, "count" : 15 }
{ "_id" : "Overplex", "bucket" : 0, "count" : 4 }
{ "_id" : "Overplex", "bucket" : 30, "count" : 21 }
{ "_id" : "Overplex", "bucket" : 56, "count" : 21 }
{ "_id" : "Velity", "bucket" : 0, "count" : 8 }
{ "_id" : "Velity", "bucket" : 30, "count" : 15 }
{ "_id" : "Velity", "bucket" : 56, "count" : 16 }

which can easily be restructured into the desired form by the server.

This query is quite complicated, and I have a feeling that using the map-reduce form of aggregation might have been easier, but I'm not sure of that. Below I'll try to document what's happening in the query.

$bucket

The first step takes all the users and puts them in buckets depending on whether their age is in the range [0, 30), [30, 56), or [56, 200). The ranges are exclusive on the right, so the middle bucket is between 30 and 55, inclusive on both ends. I'm assuming here that no one is older than 200; an alternative would have been to use the default option to $bucket for the top bucket.

The $push creates an array of all the companies for each entry in each bucket, so on it's own this step yields something like:

{ "_id" : 0, "companies" : [ "Dognost", "Eschoir", "Caxt", "Velity", "Blurrybus", ...] }
{ "_id" : 30, "companies" : [ "Eschoir", "Dognost", "Dognost", "Overplex", ] }
{ "_id" : 56, "companies" : [ "Blurrybus", "Dognost", ]}

$unwind

This "unwinds" all those arrays of companies so we have separate entries for each bucket/company pair:

{ "_id" : 0, "companies" : "Dognost" }
{ "_id" : 0, "companies" : "Eschoir" }
{ "_id" : 0, "companies" : "Caxt" }
{ "_id" : 0, "companies" : "Velity" }
{ "_id" : 0, "companies" : "Blurrybus" }
...

Here the _id is actually the age bucket (0, 30, or 56).

$group (and $project and $sort)

The $group aggregation groups by company and age bucket, and counts how many there are, yielding something like:

{ "_id" : { "bkt" : 56, "company" : "Eschoir" }, "count" : 15 }
{ "_id" : { "bkt" : 56, "company" : "Velity" }, "count" : 16 }
{ "_id" : { "bkt" : 56, "company" : "Caxt" }, "count" : 24 }
{ "_id" : { "bkt" : 56, "company" : "Dognost" }, "count" : 11 }
{ "_id" : { "bkt" : 56, "company" : "Blurrybus" }, "count" : 13 }
{ "_id" : { "bkt" : 30, "company" : "Caxt" }, "count" : 18 }
{ "_id" : { "bkt" : 30, "company" : "Blurrybus" }, "count" : 19 }
{ "_id" : { "bkt" : 0, "company" : "Blurrybus" }, "count" : 10 }
{ "_id" : { "bkt" : 0, "company" : "Caxt" }, "count" : 6 }
{ "_id" : { "bkt" : 0, "company" : "Velity" }, "count" : 8 }
{ "_id" : { "bkt" : 0, "company" : "Eschoir" }, "count" : 7 }
{ "_id" : { "bkt" : 30, "company" : "Dognost" }, "count" : 17 }
{ "_id" : { "bkt" : 0, "company" : "Dognost" }, "count" : 9 }
{ "_id" : { "bkt" : 0, "company" : "Overplex" }, "count" : 4 }
{ "_id" : { "bkt" : 30, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 56, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 30, "company" : "Eschoir" }, "count" : 16 }
{ "_id" : { "bkt" : 30, "company" : "Velity" }, "count" : 15 }

At this point we're arguably done as this has all the data in a form we could use on the server. The following $project and $sort just reorganize and sort the data into a form that will make it slightly easier for the server to process.

@NicMcPhee
Copy link
Member Author

I think that it is indeed easier using mapReduce(), although that's not trivial either. My mapReduce solution (that works in the Mongo Shell - I still need to convert this to Java) is:

var mapFunction = function() { 
    emit({ company: this.company, 
           ageBracket: (this.age<30?"under30":((this.age<=55)?"between30and55":"over55")) }, 
         1); 
};

var reduceFunction = function(k, vs) { return Array.sum(vs) }

db.users.mapReduce(mapFunction, reduceFunction, { out: { inline: 1 } })

which returns:

{
	"results" : [
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "between30and55"
			},
			"value" : 19
		},
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "over55"
			},
			"value" : 13
		},
		{
			"_id" : {
				"company" : "Blurrybus",
				"ageBracket" : "under30"
			},
			"value" : 10
		},
		{
			"_id" : {
				"company" : "Caxt",
				"ageBracket" : "between30and55"
			},
			"value" : 18
...

The output could be cleaned up some with a finalize clause, but we wouldn't be able to (easily) group companies together into a single entry without introducing a whole grouping pipeline after this. (That would work, but it's not clear if we care enough?)

I think there's probably an even cleaner mapReduce() solution, but this at least shows that it's possible and not too weird. I really don't like the use of the ?: ternary operator, and even worse it's nested! I suspect I can avoid that when I switch to Java.

NicMcPhee referenced this issue in UMM-CSci-3601/3601-lab4_mongo-pre-S20 Feb 24, 2019
This basically completes at least one version of the server-side code
for issue #23, providing an example of computing summary data from the
database.

This uses MongoDB's map-reduce tool, and it's not _too_ bad except for
all the insane JSON/BSON document manipulation that's necessary to do
this kind of stuff in Java. Ugh.

There's still no client-side code for this, and no write-up or
documentation anywhere. There probably should be some substantial
comments added to the code as well.
@kklamberty kklamberty reopened this Sep 25, 2019
@kklamberty
Copy link
Member

We created a related issue and closed that. I'm not sure that we should close this issue, but I am tempted to close it. For now, I will take off the high priority label since it is no longer urgent. It's still interesting, and I'm not sure what to do, explicitly, with the issue itself... nor am I sure how to incorporate this meaningfully in a lab. I personally think this would be a wonderful example to flesh out and make available for students to use as a model when they are working on the project.

@kklamberty
Copy link
Member

Even if we don't require student to do anything with this, it would be really neat to link to a couple of examples of doing this. Those examples could be part of what we give the students and we could even include multiple ways of doing this work to show them the possibilities. Especially since @NicMcPhee did a lot of work to figure this out and it's all right here, it would make sense to include it in the lab for students.

@kklamberty
Copy link
Member

@wallerli or @floogulinc - Do either of you have suggestions about how to handle this task? @NicMcPhee tried a couple of things (listed and described in detail above), but we thought maybe you'd have some ideas.

@kklamberty kklamberty transferred this issue from UMM-CSci-3601/3601-lab4_mongo-pre-S20 Feb 19, 2020
@floogulinc floogulinc transferred this issue from UMM-CSci-3601/3601-lab4-mongo-2020 Mar 1, 2021
@NicMcPhee
Copy link
Member Author

So I sort-of did a thing that's at least related to this in #81, which aggregates the user data by company, creating a list of companies, each of which contains a list of Users (employees) projected down to just their name and ID. I then added UI and E2E/Junit tests (not so much Karma) for the code.

In a perfect world, the lists of user names would actually be links to the user profiles, but I ran out of time, so that can be a feature for another day.

I still like the idea of summarizing companies by age of employees that is described above, so I'm leaving this open in the hopes that one day we'll get there.

I found GitHub CoPilot quite useful in creating and debugging queries. There's also a MongoDB specific AI chatbot available when you're in the MongoDB docs; I also found this somewhat useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants