Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Module Proposal - Advanced Data Structure Design #4640

Open
jamperezmondragon opened this issue Jul 24, 2024 · 1 comment
Open

New Module Proposal - Advanced Data Structure Design #4640

jamperezmondragon opened this issue Jul 24, 2024 · 1 comment
Labels
content content-related issue

Comments

@jamperezmondragon
Copy link

jamperezmondragon commented Jul 24, 2024

Hello, I was talking to lunchbox and thought maybe it could be an interesting addition to the guide, he told me to write an issue.

This (and the next 2 proposals) are classes I taught to the Mexican IOI team some months ago (I am the team leader).

This is just a draft, if you like it I can work more on it and make a PR in the next couple weeks.

Advanced data structure design

Inspired by this blog and it's video.

What is an invariant? What is a monovariant?

Example problem to define and showcase this concepts is IMO 2009 C1.

The main idea of this module is to think about data structures as graphs where each node stores an invariant or a monovariant.

It's a good idea to start designing a data structure as a "bag" where you can add nodes, and as you start to recognize more properties you'll need to solve the problem, use degrees of freedom to narrow down the structure of the algorithm, connecting the nodes in some way and adding invariants or monovariants to them.

Definitions:

  • Deltas
  • blocking
  • Lazy
  • Rollback

Example problems

Range Updates and Sums with Square Root Decomposition

The main idea is to do blocking, and maintain several invariants for each block. Say there's $N/B$ blocks, we'll maintain the following variables:

  • First we'll maintain an array A, with each element corresponding to an element of the original array.
  • We will also maintain the array delta, with one element for each block.
  • An array lazyset with one element for each block.
  • An array sum with one element for each block.
  • And two arrays lsl and lsr with one element for each block.

Our invariants are:

  • At any moment, the sum of the block $i$ is $sum[i]$.
  • At any moment, the value of the $i$ th -element of the array is:
    • $lazyset[i/B] + delta[i/B]$ if $lsl[i/B] \le i \le lsr[i/B]$ (so our invariant is basically a range of values for which the lazyset operation applies).
    • $A[i] + delta[i/B]$ otherwise.

We will process updates in such a way that this invariants will always hold. Whenever we process any update, if it's contained within the same block, reconstruct it in $O(B)$ time.

  • Consider sum updates,
    • For a block $i$ completely contained in the query, we just have to update the value of $sum[i]$ and $delta[i]$, this part has time complexity $O(N/B)$.
    • For the two overlapping blocks, we are only updating a prefix or a suffix, so we can easily update the corresponding ranges in arrays lsl and lsr or update the values of A for indices outside the range. this part has complexity $O(B)$.
  • For set updates, we do it similarly, for blocks completely contained in the query we update the ranges, the lazy value and the sum value. And overlapping blocks are handled in the same way as above.
    So each of our updates take $O(N/B + B)$. And each of our queries take $O(N/B + B)$ which are processed by just following the invariant we stated initially, processing overlapping blocks separately than blocks contained in the whole range.

Thus, the final complexity is $O(Q(N/B + B))$, and since $Q = O(N)$ we can take $B = \sqrt{N}$, which is optimal by AM-GM inequality.

Here's an implementation of this idea, although the comments are in Spanish haha.

You can see that the code is very simple and easy to write because many parts have the same structure and you just have to copy paste them.

Scoreboard Semanal

Translation of the problem:

There's a set of students. Each week, they solve some amount of problems (could be negative). After each week, the student $i$ will get 1 stress point for each student with more solved problems than them.
given $N \le 10^5$ students, and $Q \le 10^5$ weeks, each week having $U_i$ updates to the ranking (which consists of a student that solved a non zero number of problems and how many they solved), determine the final number of stress points
each student gets after the $Q$ weeks. It is guaranteed that the sum of $U_i$ is at most $5\times 10^5$ and that the number of solved problems for each student will never leave the range $[-5\times 10^5, 5\times10^5]$.

Solution: (I have yet to write a more detailed and motivated explanation)

  • Consider $10^6 + 1$ bags. Kind-of like a frequency array of how many solved problems each person has. Now, each update, signifies that a person will change of bag right?
  • And after each week, we have to sum (to the delta of) each bag, the number of people in the bags to its right.
  • Assume we had an array $X$ where $X[i]$ = the number of stress points we added to bag $i$ last week. How does this array change after each update?
    • We only have to increase or decrease by one a specific range!
  • Thus, we can maintain a segment tree, where in each node we store the number of people in the range, and the number of stress points we have to add to every bag in the range.
  • After each week, we send a lazy operation, that for each left node, we add the number of people in the range represented by it's brother.
  • We can propagate this lazy storing the last time we propagated it, and we also have to propagate it whenever we change a person from bags.

T-shirts

We iterate over the t-shirts in decreasing order of quality untying by cost. Each t-shirt with cost $c$ will be bought by any person with a budget larger than or equal to $c$ when that t-shirt is processed. We want to decrease by $c$ all the budgets larger than or equal to $c$, and add 1 to their answers.

  • How can we design a data structure that allows us to process said queries?
  • Observe that all the budgets will be decreasing. So if we store them in $B$ buckets of the range $[0, 10^9)$, each budget will only change buckets $B$ times.
  • We can also store a delta for each bucket, whenever we add a budget to a bucket we add (it's value - the delta of the bucket), and when we take it out we add the delta to it's value. We can do the same with a delta for answers.
  • What does a query look like? how can we determine which values we have to move from buckets?
  • Here's the point where we start to define what structure we will use to maintain buckets. Until now it's been just an abstract bag where we add and remove elements. Let's analyze the properties we need to determine the structure:
    • When we make a query with value $c$, $c$ will lie within a block, determined by a range $[l, r)$.
    • We have to update the values in the range $[c, r)$, and the rest of the blocks can be handled by deltas.
    • We also have to check for each block, if there's budgets that changes blocks. This is straightforward, for block $i$, the budgets with real value in $[max(c, l_i), min(l_i + c - 1, r_i))$ will change blocks.
    • Beware of the degrees of freedom we have right now, we haven't determined how we choose the values of $l_i$ and $r_i$, nor have we determined what this buckets will be.
    • We want to store the buckets in some sort of ordered structure that lets us easily iterate over the range of values we have to move (with an offset determined by the delta).
    • But we also want to iterate over the range $[c, r)$ when $l \le c < r$. What if we choose $l_i$ and $r_i$ in such a way that every element in the range $[c, r)$ will change buckets? What condition would we need?
    • we have $l \le c < r$ and $r - c \le l$ which implies $r - l \le l$ and thus $r \le 2l$.
    • Dang! what if we choose the equality case $l_i = 2^i$ and $r_i = 2^{i + 1}$? In other words, store the budgets on logn sets, the $i$th set contains all the budgets that currently have most significant bit equal to $i$.
    • We can now use a standard set to store the buckets, and use the lower_bound operation to find the needed ranges! What is the time complexity of this algorithm?

I have an implementation of this idea but it's a bit messy.

Welcome home, Chtholly

Similar idea to the editorial, I think it's pretty educational and we can motivate each step of the solution after seeing the problems above.

  • Another example idea could be O(N)/O(1) algorithm for rmq.

Problems to try

@bqi343
Copy link
Member

bqi343 commented Jul 25, 2024

Looks interesting, though I don't really know it would go considering it's a mix of some topics already in the guide / some new things. Maybe after the others in the Plat range queries section?

Questions:

Thus, the final complexity is $O(Q(N/B + B))$, and since $Q = O(N)$ we can take $B = \sqrt{N}$, which is optimal by AM-GM inequality.

  • What does the choice of $B$ have to do with $Q$?
  • What does T-shirts have to do with logarithmic decomposition? It looks superficially similar in the sense that you have $O(\log N)$ copies of some data structure and elements are moving between data structures, but they're solving rather different tasks.
    • Btw, I think a more straightforward way to solve it is to apply https://codeforces.com/blog/entry/108601 - maintain one balanced binary search tree, instead of one per bucket. The BBST should support lazily adding to everything, and merging disjoint ranges of values.

@bqi343 bqi343 added the content content-related issue label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content content-related issue
Projects
None yet
Development

No branches or pull requests

2 participants