-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide example of computing summary data in MongoDB #5
Comments
We have discussed adding some kind of summary data about users. |
Do you remember more specifically what we had been thinking of? Something like the number of people for each company would be a start, but that's a lot simpler than what we ask them to do. Maybe number of people for each company, split into two or three groups by age? |
I don't think we had a specific plan, but I like your idea of sorting/grouping by age. This kind of information might be helpful for a real company to see who might need/want information on retirement, so we can make a decent use case. We could do something less interesting with a ridiculous use case like for some reason we care how many letters are in someone's name? How many start with a certain letter? Maybe we are making monogrammed towels to save on paper towel waste and want to know how many of each type to make? Or, we need to estimate the cost of personalizing the towels to be green and also stop the spread of germs? |
So I'm going with a summary by company, followed by age, so something like:
I used https://next.json-generator.com/ to generate the data using this template:
|
This is essentially a problem of nested groups, as we want to group by both company and age range. That didn't turn out to be trivial. After a fair bit of flailing, I eventually came up with the following query: db.users.aggregate( [
{ $bucket: { groupBy: "$age", boundaries: [ 0, 30, 56, 200 ],
output: { companies: { $push: "$company" } } } },
{ $unwind: "$companies" },
{ $group: { _id: { bkt: "$_id", company: "$companies" }, count: { $sum: 1 } } },
{ $project: { _id: "$_id.company", bucket: "$_id.bkt", count: "$count" } },
{ $sort: { _id: 1, bucket: 1 } } ] ) that returns: { "_id" : "Blurrybus", "bucket" : 0, "count" : 10 }
{ "_id" : "Blurrybus", "bucket" : 30, "count" : 19 }
{ "_id" : "Blurrybus", "bucket" : 56, "count" : 13 }
{ "_id" : "Caxt", "bucket" : 0, "count" : 6 }
{ "_id" : "Caxt", "bucket" : 30, "count" : 18 }
{ "_id" : "Caxt", "bucket" : 56, "count" : 24 }
{ "_id" : "Dognost", "bucket" : 0, "count" : 9 }
{ "_id" : "Dognost", "bucket" : 30, "count" : 17 }
{ "_id" : "Dognost", "bucket" : 56, "count" : 11 }
{ "_id" : "Eschoir", "bucket" : 0, "count" : 7 }
{ "_id" : "Eschoir", "bucket" : 30, "count" : 16 }
{ "_id" : "Eschoir", "bucket" : 56, "count" : 15 }
{ "_id" : "Overplex", "bucket" : 0, "count" : 4 }
{ "_id" : "Overplex", "bucket" : 30, "count" : 21 }
{ "_id" : "Overplex", "bucket" : 56, "count" : 21 }
{ "_id" : "Velity", "bucket" : 0, "count" : 8 }
{ "_id" : "Velity", "bucket" : 30, "count" : 15 }
{ "_id" : "Velity", "bucket" : 56, "count" : 16 } which can easily be restructured into the desired form by the server. This query is quite complicated, and I have a feeling that using the map-reduce form of aggregation might have been easier, but I'm not sure of that. Below I'll try to document what's happening in the query.
The first step takes all the users and puts them in buckets depending on whether their age is in the range [0, 30), [30, 56), or [56, 200). The ranges are exclusive on the right, so the middle bucket is between 30 and 55, inclusive on both ends. I'm assuming here that no one is older than 200; an alternative would have been to use the The { "_id" : 0, "companies" : [ "Dognost", "Eschoir", "Caxt", "Velity", "Blurrybus", ...] }
{ "_id" : 30, "companies" : [ "Eschoir", "Dognost", "Dognost", "Overplex", … ] }
{ "_id" : 56, "companies" : [ "Blurrybus", "Dognost", … ]}
This "unwinds" all those arrays of companies so we have separate entries for each bucket/company pair: { "_id" : 0, "companies" : "Dognost" }
{ "_id" : 0, "companies" : "Eschoir" }
{ "_id" : 0, "companies" : "Caxt" }
{ "_id" : 0, "companies" : "Velity" }
{ "_id" : 0, "companies" : "Blurrybus" }
... Here the
The { "_id" : { "bkt" : 56, "company" : "Eschoir" }, "count" : 15 }
{ "_id" : { "bkt" : 56, "company" : "Velity" }, "count" : 16 }
{ "_id" : { "bkt" : 56, "company" : "Caxt" }, "count" : 24 }
{ "_id" : { "bkt" : 56, "company" : "Dognost" }, "count" : 11 }
{ "_id" : { "bkt" : 56, "company" : "Blurrybus" }, "count" : 13 }
{ "_id" : { "bkt" : 30, "company" : "Caxt" }, "count" : 18 }
{ "_id" : { "bkt" : 30, "company" : "Blurrybus" }, "count" : 19 }
{ "_id" : { "bkt" : 0, "company" : "Blurrybus" }, "count" : 10 }
{ "_id" : { "bkt" : 0, "company" : "Caxt" }, "count" : 6 }
{ "_id" : { "bkt" : 0, "company" : "Velity" }, "count" : 8 }
{ "_id" : { "bkt" : 0, "company" : "Eschoir" }, "count" : 7 }
{ "_id" : { "bkt" : 30, "company" : "Dognost" }, "count" : 17 }
{ "_id" : { "bkt" : 0, "company" : "Dognost" }, "count" : 9 }
{ "_id" : { "bkt" : 0, "company" : "Overplex" }, "count" : 4 }
{ "_id" : { "bkt" : 30, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 56, "company" : "Overplex" }, "count" : 21 }
{ "_id" : { "bkt" : 30, "company" : "Eschoir" }, "count" : 16 }
{ "_id" : { "bkt" : 30, "company" : "Velity" }, "count" : 15 } At this point we're arguably done as this has all the data in a form we could use on the server. The following |
I think that it is indeed easier using var mapFunction = function() {
emit({ company: this.company,
ageBracket: (this.age<30?"under30":((this.age<=55)?"between30and55":"over55")) },
1);
};
var reduceFunction = function(k, vs) { return Array.sum(vs) }
db.users.mapReduce(mapFunction, reduceFunction, { out: { inline: 1 } }) which returns: {
"results" : [
{
"_id" : {
"company" : "Blurrybus",
"ageBracket" : "between30and55"
},
"value" : 19
},
{
"_id" : {
"company" : "Blurrybus",
"ageBracket" : "over55"
},
"value" : 13
},
{
"_id" : {
"company" : "Blurrybus",
"ageBracket" : "under30"
},
"value" : 10
},
{
"_id" : {
"company" : "Caxt",
"ageBracket" : "between30and55"
},
"value" : 18
... The output could be cleaned up some with a I think there's probably an even cleaner |
This basically completes at least one version of the server-side code for issue #23, providing an example of computing summary data from the database. This uses MongoDB's map-reduce tool, and it's not _too_ bad except for all the insane JSON/BSON document manipulation that's necessary to do this kind of stuff in Java. Ugh. There's still no client-side code for this, and no write-up or documentation anywhere. There probably should be some substantial comments added to the code as well.
We created a related issue and closed that. I'm not sure that we should close this issue, but I am tempted to close it. For now, I will take off the high priority label since it is no longer urgent. It's still interesting, and I'm not sure what to do, explicitly, with the issue itself... nor am I sure how to incorporate this meaningfully in a lab. I personally think this would be a wonderful example to flesh out and make available for students to use as a model when they are working on the project. |
Even if we don't require student to do anything with this, it would be really neat to link to a couple of examples of doing this. Those examples could be part of what we give the students and we could even include multiple ways of doing this work to show them the possibilities. Especially since @NicMcPhee did a lot of work to figure this out and it's all right here, it would make sense to include it in the lab for students. |
@wallerli or @floogulinc - Do either of you have suggestions about how to handle this task? @NicMcPhee tried a couple of things (listed and described in detail above), but we thought maybe you'd have some ideas. |
So I sort-of did a thing that's at least related to this in #81, which aggregates the user data by company, creating a list of companies, each of which contains a list of In a perfect world, the lists of user names would actually be links to the user profiles, but I ran out of time, so that can be a feature for another day. I still like the idea of summarizing companies by age of employees that is described above, so I'm leaving this open in the hopes that one day we'll get there. I found GitHub CoPilot quite useful in creating and debugging queries. There's also a MongoDB specific AI chatbot available when you're in the MongoDB docs; I also found this somewhat useful. |
We ask them to use MongoDB to compute some summary data, but they have no examples of that in the starter code. We should fix that.
The text was updated successfully, but these errors were encountered: