-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathnotes.xml
2324 lines (2217 loc) · 153 KB
/
notes.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<sheaf title="Data Mechanics"
subtitle="for Pervasive Systems and Urban Applications"
author="Andrei Lapets"
authorlink="http://lapets.io"
>
<section title="Introduction, Background, and Motivation">
<subsection title="Overview">
<text><![CDATA[
With over half of the world's population living in cities [<a href="#625662870761">#</a>], and given the possibility that cities are a more efficient [<a href="#aiid:1855401">#</a>] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (<a href="https://en.wikipedia.org/wiki/Smart_city"><i>smart cities</i></a>, <a href="https://www.sdxcentral.com/articles/news/englands-bristol-is-building-the-first-software-defined-city/2015/03/"><i>software-defined cities</i></a>, and the <a href="https://en.wikipedia.org/wiki/Internet_of_Things"><i>Internet of Things</i></a>), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.
]]></text>
<paragraph><![CDATA[
In this course, we introduce and define the novel term <i>data mechanics</i> to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term <i>mechanics</i> specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.
]]></paragraph>
</subsection>
<subsection title="Data Mechanics Repository and Platform">
<text hooks="math"><![CDATA[
This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including:
<ul>
<li><a href="https://data.boston.gov/">Analyze Boston</a>;</li>
<li><a href="http://www.cambridgema.gov/departments/opendata">City of Cambridge Open Data Portal</a>;</li>
<li><a href="http://data.brooklinema.gov/">Brookline OpenData</a>;</li>
<li><a href="http://bostonopendata.boston.opendata.arcgis.com/">BostonMaps: Open Data</a>;</li>
<li><a href="https://dataverse.harvard.edu/dataverse/BARI">Boston Area Research Initiative Dataverse</a>;</li>
<li><a href="https://www.massdot.state.ma.us/DevelopersData.aspx">MassDOT Developers Resources</a>;</li>
<li><a href="http://www.mass.gov/opendata/#/">MassData</a>.</li>
</ul>
We will also secure additional data sets from the above sources as well as a few others. In particular, some nationwide data repositories may be worth considering:
<ul>
<li><a href="https://chronicdata.cdc.gov/500-Cities/500-Cities-Local-Data-for-Better-Health/6vp6-wxuq">500 Cities: Local Data for Better Health</a>.</li>
</ul>
For the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.
]]></text>
<paragraph hooks="math"><![CDATA[
Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., <a href="https://www.socrata.com/">Socrata</a>) to make queries and retrieve data.
]]></paragraph>
<paragraph hooks="math"><![CDATA[
The overall architecture of the service platform will have at least the following:
<ul>
<li>a database/storage backend (<b>"repository"</b>) that houses:
<ul>
<li>original and derived data sets, with annotations that include:
<ul>
<li>from where, when, and by what algorithm it was retrieved</li>
<li>using what integration algorithms it was derived</li>
</ul>
</li>
<li>algorithms for data retrieval or integration, with references that include:
<ul>
<li>when it was written and by whom</li>
<li>in what data sets it is derived</li>
<li>from what component algorithms it is composed (if it is such)</li>
</ul>
</li>
</ul>
</li>
<li>a web service (<b>"platform"</b>) with an <i>application program interface</i> (API) for running analysis and optimization algorithms:
<ul>
<li>a defined language for defining analysis and optimization algorithms over the data stored in the repository</li>
<li>an interface for submitting and running algorithms</li>
</ul>
</li>
<li>other features for data and result visualization, simulation using partial data, etc.</li>
</ul>
]]></paragraph>
</subsection>
<subsection title="Mathematical Modeling, Analysis Algorithms, and Optimization Techniques">
<text hooks="math"><![CDATA[
There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:
<ul>
<li>a defined language for defining analysis and optimization algorithms over the data stored in the repository,</li>
<li>dual decomposition,</li>
<li>online optimization.</li>
</ul>
The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:
<ul>
<li>characterizing intersections and coming up with a metric that incorporates:
<ul>
<li>intersection throughput (people per unit time),</li>
<li>modes of transportation (public transport, biking, walking),</li>
<li>intersection safety (vehicle speed, accidents, and so on),</li>
<li>intersection organization (no left turns, and so on);
</ul>
</li>
<li>characterizing streets and deriving metrics for:
<ul>
<li>number of parking spaces,</li>
<li>probability of an accident (e.g., using different modes of transportation),</li>
<li>senior and handicapped mobility;</li>
</ul>
</li>
<li>characterizing neighborhoods:
<ul>
<li>economic condition and gentrification,</li>
<li>senior and handicapped accessibility;</li>
</ul>
</li>
<li>how to allocate resources to optimize some of the metrics above:
<ul>
<li>where to perform repairs,</li>
<li>how to improve housing affordability,</li>
<li>where to place bike racks of ride sharing stations,</li>
<li>senior and handicapped accessibility;</li>
</ul>
</li>
<li>answering immediate questions relevant to an individual (e.g., in an app):
<ul>
<li>is a building accessible,</li>
<li>is there a place to park (or will there likely be at some other time),</li>
<li>is a neighborhood safe (or is it becoming less or more safe),</li>
<li>is a neighborhood affordable (or is it becoming less or more affordable),</li>
<li>are healthcare services nearby.</li>
</ul>
</li>
<li>integrating public transportation data with other data (e.g., Waze):
<ul>
<li>why are buses on a certain route late,</li>
<li>performance and problems during unexpected events (e.g., snow).</li>
</ul>
</li>
</ul>
The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.
]]></text>
</subsection>
</section>
<section title="Modeling Data and Data Transformations">
<text><![CDATA[
To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the <i>relational model</i>: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.
]]></text>
<paragraph><![CDATA[
In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of <i>data provenance</i> or <i>data lineage</i>. The <i>provenance</i> of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:
<ul>
<li>the same data be generated again or reproduced if an error occurs,</li>
<li>a data set can be updated from the original source if the source has been updated or changed,</li>
<li>the source of an inconsistency or aberration of data can be investigated,</li>
<li>any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.</li>
</ul>
]]></paragraph>
<subsection title="Relational Data and the MapReduce Paradigm">
<text hooks="math"><![CDATA[
The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.
]]></text>
<definition required="true" hooks="math" id="e601deb568ed46a1a1d741907a6dcfa9">
<text><![CDATA[
A <i>data set</i> (also known as a <i>store</i> or <i>database</i>) is a multiset %R: a collection (possibly with duplicates) of tuples of the form (%x_1,...,%x_{%n}) taken from the set product %X_1 \times ... \times %X_{%n}. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of <i>keys</i>, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.
]]></text>
</definition>
<definition required="true" hooks="math" id="a123deb568ed46a1a1d436907a6dcfa9">
<text><![CDATA[
A <i>data transformation</i> %T: %A \rightarrow %B is a mapping from one space of data sets %A to another space of data sets %B. Notice that an individual data set %S (i.e., a relation or, equivalently, a set of tuples) is just an <i>element</i> of %A.
]]></text>
</definition>
<text hooks="math"><![CDATA[
Some of the typical building blocks for data transformations in the relational model are:
<ul>
<li>union and difference (intersection can also be defined in terms of these),</li>
<li>projection (sometimes generalized into extended projection),</li>
<li>selection (filtering),</li>
<li>renaming,</li>
<li>Cartesian product,</li>
<li>variants of join operations (many can be constructed using the above, but other variants have been added as extensions),</li>
<li>aggregation (an extension).</li>
</ul>
One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.
]]></text>
<example required="true" id="9da373c4cc654556bf2fa3fed6d56995">
<text hooks="math"><![CDATA[
We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.
]]></text>
<code class="py"><![CDATA[
def union(R, S):
return R + S
def difference(R, S):
return [t for t in R if t not in S]
def intersect(R, S):
return [t for t in R if t in S]
def project(R, p):
return [p(t) for t in R]
def select(R, s):
return [t for t in R if s(t)]
def product(R, S):
return [(t,u) for t in R for u in S]
def aggregate(R, f):
keys = {r[0] for r in R}
return [(key, f([v for (k,v) in R if k == key])) for key in keys]
]]></code>
</example>
<example required="true" id="ebd9fe9c61014bc9a2d743e069dc9d44">
<text hooks="math"><![CDATA[
We consider a few simple examples that illustrate how transformations can be constructed within the relational model. We start by showing how <code>select</code> can be used with a predicate to filter a data set.
]]></text>
<code class="py"><![CDATA[
>>> def red(t): return t == 'tomato'
>>> select(['banana', 'tomato'], red)
['tomato']
]]></code>
<text hooks="math"><![CDATA[
Suppose we have two data sets and want to join them on a common field. The below sequence illustrates how that can be accomplished by building up the necessary expression out of simple parts.
]]></text>
<code class="py"><![CDATA[
>>> X = [('Alice', 22), ('Bob', 19)]
>>> Y = [('Alice', 'F'), ('Bob', 'M')]
>>> product(X,Y)
[(('Alice', 'F'), ('Alice', 22)), (('Alice', 'F'), ('Bob', 19)), (('Bob', 'M'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]
>>> select(product(X,Y), lambda t: t[0][0] == t[1][0])
[(('Alice', 'F'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]
>>> project(select(product(X,Y), lambda t: t[0][0] == t[1][0]), lambda t: (t[0][0], t[0][1], t[1][1]))
[('Alice', 'F', 22), ('Bob', 'M', 19)]
]]></code>
<text hooks="math"><![CDATA[
Finally, the sequence below illustrates how we can compute an aggregate value for each unique key in a data set (such as computing the total age by gender).
]]></text>
<code class="py"><![CDATA[
>>> X = [('Alice', 'F', 22), ('Bob', 'M', 19), ('Carl', 'M', 25), ('Eve', 'F', 27)]
>>> project(X, lambda t: (t[1], t[2]))
[('F', 22), ('M', 19), ('M', 25), ('F', 27)]
>>> aggregate(project(X, lambda t: (t[1], t[2])), sum)
[('F', 49), ('M', 44)]
]]></code>
<text hooks="math"><![CDATA[
The following sequence explains what is happening inside the <code>aggregate</code> function for a particular key.
]]></text>
<code class="py"><![CDATA[
>>> Y = project(X, lambda t: (t[1], t[2]))
>>> keys = {t[0] for t in Y}
>>> keys
{'F', 'M'}
>>> [v for (k,v) in Y if k == 'F']
[22, 27]
>>> sum([v for (k,v) in Y if k == 'F'])
49
>>> ('F', sum([v for (k,v) in Y if k == 'F']))
('F', 49)
]]></code>
</example>
<example required="true" id="ffbbae67647e4daf838a79fb814e733a">
<text hooks="math"><![CDATA[
Consider the following line of Python code that operates on a data set <code>D</code> containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:
]]></text>
<code class="py"><![CDATA[
R = sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])
]]></code>
<text hooks="math"><![CDATA[
We can identify which building blocks for transformations available in the relational model are being used in the above code; we can also draw a corresponding flow diagram that shows how they fit together.
]]></text>
<text><![CDATA[
<div class="pql" style="border:0px solid #000000; height:80px; display:inline-block; width:100%;">
!table([
['r:select``D', 'r:project``...', 'r:agg sum``...', 'R']
])
</div>
]]></text>
<text hooks="math"><![CDATA[
We can also rewrite the above using the <a href="#9da373c4cc654556bf2fa3fed6d56995">building blocks for transformations</a> in the relational model.
]]></text>
<code class="py"><![CDATA[
R = aggregate(project(select(D, lambda psc: if psc[1] == "Massachusetts" and psc[2] == "Trump"), lambda psc: 1), sum)
]]></code>
<text hooks="math"><![CDATA[
Notice that the order of the operations is important. If we tried to do the projection before the selection, the information that we are using to perform the selection would be gone after the projection. On the other hand, two selection (one immediately followed by another) can be done in any order. This is one reason why it makes sense to call the paradigm a relational <i>algebra</i>: there are algebraic laws that govern how the operations can be arranged.
]]></text>
</example>
<paragraph hooks="math"><![CDATA[
In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: <i>map</i> and <i>reduce</i>:
<ul>
<li>a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,</li>
<li>a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.</li>
</ul>
Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.
]]></paragraph>
<example required="true" id="ebd9fe9c61014bc9a2d743e069dc9d5b">
<text hooks="math"><![CDATA[
We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.
]]></text>
<code class="py"><![CDATA[
def map(f, R):
return [t for (k,v) in R for t in f(k,v)]
def reduce(f, R):
keys = {k for (k,v) in R}
return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]
]]></code>
<text hooks="math"><![CDATA[
A <code>map</code> operation applies some function <code>f</code> to every key-value tuple and produces zero or more new key-value tuples. A <code>reduce</code> operation collects all values under the same key and performs an aggregate operation <code>f</code> on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.
]]></text>
</example>
<paragraph hooks="math"><![CDATA[
At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.
]]></paragraph>
<example required="true" id="ebd9fe9c61014bc9a2d743e069dc9d5a">
<text hooks="math"><![CDATA[
We can use the Python implementations of the map and reduce operations in an <a href="#ebd9fe9c61014bc9a2d743e069dc9d5b">example above</a> to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how projection and aggregation can be implemented in the code below.
]]></text>
<code class="py"><![CDATA[
R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]
# Projection keeps only gender and age.
# The original key (the name) is discarded.
X = map(lambda k,v: [(v[0], v[1])], R)
# Aggregation by the new key (i.e., gender).
Y = reduce(lambda k,vs: (k, sum(vs)), X)
]]></code>
<text hooks="math"><![CDATA[
Selection is not straightforward to implement because this is not a common use case for MapReduce workflows. However, it is possible by treating each tuple in the input data set as its own key and then keeping on the keys at the end.
]]></text>
<code class="py"><![CDATA[
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).
]]></code>
<text hooks="math"><![CDATA[
We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple. Join operations were also not originally envisioned as a common use case for MapReduce workflows.
]]></text>
<code class="py"><![CDATA[
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]
X = map(lambda k,v: [(k, ('Age', v))], R)\
+ map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
lambda k,vs:\
(k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
X\
)
]]></code>
</example>
<example required="true" id="fa12ae67647e4daf838a79fb814e733b">
<text hooks="math"><![CDATA[
Suppose you have a data set containing tuples of the form (<i>name</i>, <i>gender</i>, <i>age</i>). You want to produce a result of the form (<i>gender</i>, <i>total</i>) where <i>total</i> is the sum of the age values of all tuples with the corresponding <i>gender</i>. The code below illustrates using Python how this can be done in the MapReduce paradigm.
]]></text>
<code class="py"><![CDATA[
INPUT = [('Alice', ('F', 19)),\
('Bob', ('M', 23)),\
('Carl', ('M', 20)),\
('Eve', ('F', 27))]
TEMP = map(lambda k,v: [(v[0], v[1])], INPUT)
OUTPUT = reduce(lambda k,vs: (k, sum(vs)), TEMP)
]]></code>
<text hooks="math"><![CDATA[
We provide an equivalent MapReduce paradigm implementation of the algorithm using MongoDB.
]]></text>
<code class="js"><![CDATA[
db.INPUT.insert({_id:"Alice", gender:"F", age:19});
db.INPUT.insert({_id:"Bob", gender:"M", age:23});
db.INPUT.insert({_id:"Carl", gender:"M", age:20});
db.INPUT.insert({_id:"Eve", gender:"F", age:27});
db.INPUT.mapReduce(
function() {
emit(this.gender, {age:this.age});
},
function(k, vs) {
var total = 0;
for (var i = 0; i < vs.length; i++)
total += vs[i].age;
return {age:total};
},
{out: "OUTPUT"}
);
]]></code>
</example>
<example required="true" id="ffbbae67647e4daf838a79fb814e733b">
<text hooks="math"><![CDATA[
Suppose we have a data set containing tuples of the form (<i>name</i>, <i>city</i>, <i>age</i>). We want to produce a result of the form (<i>city</i>, <i>range</i>) where <i>range</i> is defined as the difference between the oldest and youngest person in each city.
<ul>
<li>We can construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).</li>
<li>Alternatively, we can use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. <b>Hint:</b> emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).</li>
</ul>
]]></text>
<paragraph hooks="math"><![CDATA[
One approach in the relational paradigm is to aggregate the minimum and maximum age for each city, negate the minimum ages, and aggregate once more to get the ranges.
]]></paragraph>
<code class="py"><![CDATA[
NCA = [('Alice', 'Boston', 23), ('Bob', 'Boston', 19), ('Carl', 'Seattle', 25)]
CA = [(c,a) for (n,c,a) in NCA]
MIN = aggregate(CA, min)
MIN_NEG = [(c,-1*a) for (c,a) in MIN]
MAX = aggregate(CA, max)
RESULT = aggregate(union(MIN_NEG, MAX), sum)
]]></code>
<text hooks="math"><![CDATA[
Below is a flow diagram that represents the above transformations.
]]></text>
<text><![CDATA[
<div class="pql" style="border:0px solid #000000; height:280px; display:inline-block; width:100%;">
!table([
[null , null, 'r:proj``MIN', 'rd:union``MIN_NEG', null , null],
['r:proj``NCA', 'ru:aggmin`rd:agg max``CA', null , null , 'r:agg sum``...', 'RESULT'],
[null , null, 'rru:union``MAX', null , null , null]
])
</div>
]]></text>
<text hooks="math"><![CDATA[
Using the MapReduce paradigm, we may prefer to follow the paradigm's two-stage breakdown of a computation by first converting every single entry in the input data set into an <i>approximation</i> of the range. For example, given only the record <code>('Alice', ('Boston', 23))</code>, in the mapping stage we might estimate the range as <code>('Boston', (23, 23, 0))</code> where the second and third entries are the minimum and maximum "known so far" given the limited information (a single data point). Then, in the reduce stage, we would combine these estimates.
]]></text>
<code class="py"><![CDATA[
NCA = [('Alice', ('Boston', 23)), ('Bob', ('Boston', 19)), ('Carl', ('Seattle', 25))]
I = map(lambda k, v: [(v[0], (v[1], v[1], 0))], NCA)
def reducer(k, vs):
age_lo = min([l for (l,h,r) in vs])
age_hi = max([h for (l,h,r) in vs])
age_ran = age_hi - age_lo
return (k, (age_lo, age_hi, age_ran))
RESULT = reduce(reducer, I)
]]></code>
</example>
<example required="true" id="cbb3ae67647e4daf838a79fb914e114a">
<text hooks="math"><![CDATA[
Consider the following algorithm implemented as a collection of transformations drawn from the relational paradigm. The input data set consists of tuples of the form (<i>intersection</i>, <i>date</i>, <i>time</i>, <i>cars</i>). Each tuple represents a single accident took place at the specified intersection, occurred on the specified date and time, and involved the specified number of cars.
]]></text>
<code class="py"><![CDATA[
D = [('Commonwealth Ave./Mass Ave.', '2016-11-02', '11:34:00', 3), ...]
M = project(D, lambda idtc: [(idtc[0], idtc[3])])
R = aggregate(M, max)
H = [(i,d,t) for ((i,m), (j,d,t,c)) in product(R, D) if i==j and m==c]
]]></code>
<text hooks="math"><![CDATA[
A data flow diagram representing the transformations is provided below.
]]></text>
<text><![CDATA[
<div class="pql" style="border:0px solid #000000; height:200px; display:inline-block; width:100%;">
!table([
['d:prod`r:proj``D', 'r:agg max``M', 'lld:prod``R'],
['r:select``...', 'r:proj``...', 'H']
])
</div>
]]></text>
<text hooks="math"><![CDATA[
Complete the following tasks. To simplify the exercise, you may assume that there is at most one accident involving each possible quantity of cars (e.g., there is only one accident that involved five cars).
<ul>
<li>Write a description of what the algorithm computes. What does the output data set represent?</li>
<li>Provide an implementation of this algorithm using the MapReduce paradigm. You only need one map and one reduce operation.</li>
</ul>
]]></text>
<solution>
<text hooks="math"><![CDATA[
The output represents the dates and times of the largest accidents for each intersection. Note that the transformation first computes the number of cars involved in the largest accident at each intersection (this is the data set <code>R</code>), and then <i>joins</i> that result with the original data to extract the exact dates and times of the accidents that had that number of cars. As an example, we provide an implementation of this algorithms using the MapReduce feature provided by MongoDB. We assume that each tuple is represented using a JSON document of the form <code>{'intersection':..., 'date':..., 'time':..., 'cars':...}</code>.
]]></text>
<code class="js"><![CDATA[
db.D.mapReduce(
function() {
emit(this.intersection, {'date':this.date, 'time':this.time, 'cars':this.cars});
},
function(intersection, dtcs) {
var index_of_largest = 0;
dtcs.forEach(function(dtc, i) {
if (dtc.cars > dtcs[index_of_largest].cars)
index_of_largest = i;
});
// The intersection will already be the key.
return {'date':dtcs[index_of_largest].date, 'time':dtcs[index_of_largest].time};
},
{ out: "R" }
);
]]></code>
</solution>
</example>
</subsection>
<subsection title="Composing Transformations into Algorithms">
<text hooks="math"><![CDATA[
Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and intermediate data sets are combined to derive new data sets over the course of the algorithm's operation.
]]></text>
<example required="true" id="cba5543907854ed28dbd3eeb874ebd54">
<text hooks="math"><![CDATA[
We can use building blocks drawn from the relational model (defined for Python in <a href="#9da373c4cc654556bf2fa3fed6d56995">an example above</a>) to construct an implementation of the %k-means clustering algorithm.
]]></text>
<code class="py"><![CDATA[
def dist(p, q):
(x1,y1) = p
(x2,y2) = q
return (x1-x2)**2 + (y1-y2)**2
def plus(args):
p = [0,0]
for (x,y) in args:
p[0] += x
p[1] += y
return tuple(p)
def scale(p, c):
(x,y) = p
return (x/c, y/c)
M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]
OLD = []
while OLD != M:
OLD = M
MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
PD = aggregate(PDs, min)
MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
MT = aggregate(MP, plus)
M1 = [(m, 1) for (m, _) in MP]
MC = aggregate(M1, sum)
M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
print(sorted(M))
]]></code>
<text hooks="math"><![CDATA[
Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, <code>[(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]</code> first filters <code>product(MPD, PD)</code> using a selection criteria <code>if p==p2 and d==d2</code> and then performs a projection from tuples of the form <code>((m,p,d), (p2,d2))</code> to tuples of the form <code>(m, p)</code> to obtain the result.
]]></text>
<text><![CDATA[
<div class="pql" style="border:0px solid #000000; height:400px; display:inline-block; width:100%;">
!table([
['rd:prod``P', null, null, 'd:agg min``PDs'],
[null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'],
['ru:prod``M', null, null, 'd:selection``...'],
['u:proj``...', 'ld:prod``MC', 'l:agg sum``M1', 'd:proj``...'],
['u:selection``...', 'l:prod``MT', null, 'lu:proj`ll:agg plus``MP']
])
</div>
]]></text>
</example>
<example required="true" id="bca743938aa04d9ea43464f941bd70bc">
<text hooks="math"><![CDATA[
We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the %k-means clustering algorithm. This implementation illustrates many of the idiosyncrasies of the MapReduce abstraction made available by MongoDB.
]]></text>
<code class="js"><![CDATA[
db.system.js.save({ _id:"dist" , value:function(u, v) {
return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});
function flatten(A) {
db[A].find().forEach(function(a) { db[A].update({_id: a._id}, a.value); });
}
function prod(A, B, AB) {
db[AB].remove({});
db.createCollection(AB);
db[A].find().forEach(function(a) {
db[B].find().forEach(function(b) {
db[AB].insert({left:a, right:b});
});
});
}
function union(A, B, AB) {
db[AB].remove({});
db.createCollection(AB);
db[A].find().forEach(function(a) {
db[AB].insert(a);
});
db[B].find().forEach(function(b) {
db[AB].insert(b);
});
}
function hash_means(M, HASH) {
db[M].mapReduce(
function() { emit("hash", {hash: this.x + this.y}); },
function(k, vs) {
var hash = 0;
vs.forEach(function(v) {
hash += v.hash;
});
return {hash: hash};
},
{out: HASH}
);
}
// We'll only perform a single product operation. Using map-reduce, we can perform
// argmax and argmin more easily. We can also use map-reduce to compare progress.
db.M.remove({});
db.M.insert([{x:13,y:1},{x:2,y:12}]);
db.P.remove({});
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);
var iterations = 0;
do {
// Compute an initial hash of the means in order to have a baseline
// against which to compare when deciding whether to loop again.
hash_means("M", "HASHOLD");
prod("M", "P", "MP");
// At this point, entries in MP have the form
// {_id:..., left:{x:13,y:1}, right:{x:4,y:5}}.
// For each point, find the distance to the closest mean. The output after
// flattening has entries of the form {_id:{x:?, y:?}, m:{x:?, y:?}, d:?}
// where the identifier is the point.
db.MPs.remove({});
db.MP.mapReduce(
function() {
var point = {x:this.right.x, y:this.right.y};
var mean = {x:this.left.x, y:this.left.y};
emit(point, {m:mean, d:dist(point, mean)});
},
function(point, vs) {
// Each entry in vs is of the form {m:{x:?, y:?}, d:?}.
// We return the one that is closest to point.
var j = 0;
vs.forEach(function(v, i) {
if (v.d < vs[j].d)
j = i;
});
return vs[j]; // Has form {m:{x:?, y:?}, d:?}.
},
{out: "MPs"}
);
// At this point, entries in MPs have the form
// {_id:{x:2, y:3}, value:{m:{x:4, y:5}, d:1}}.
flatten("MPs");
// At this point, entries in MPs have the form
// {_id:{x:2, y:3}, m:{x:4, y:5}, d:1}.
// For each mean (i.e., key), compute the average of all the points that were
// "assigned" to that mean (because it was the closest mean to that point).
db.MPs.mapReduce(
function() {
// The key is the mean and the value is the point together with its counter.
var point = this._id;
var point_with_count = {x:point.x, y:point.y, c:1};
var mean = this.m;
emit(mean, point_with_count);
},
function(key, vs) {
// Remember that the reduce operations will be applied to the values for each key
// in some arbitrary order, so our aggregation operation must be commutative (in
// this case, it is vector addition).
var x = 0, y = 0, c = 0;
vs.forEach(function(v, i) {
x += v.x;
y += v.y;
c += v.c;
});
return {x:x, y:y, c:c};
},
{ finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
out: "M"
}
);
// At this point, entries in M have the form
// {_id:{x:2, y:3}, value:{x:4, y:5}}.
flatten("M");
// At this point, entries in MPs have the form
// {_id:{x:2, y:3}, x:4, y:5}. The identifier
// value does not matter as long as it is unique.
// Compute the hash of the new set of means.
hash_means("M", "HASHNEW");
// Extract the two hashes in order to compare them in the loop condition.
var hashold = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
var hashnew = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
print(hashold);
print(hashnew);
print(iterations);
iterations++;
} while (hashold != hashnew);
]]></code>
</example>
<example required="true" id="bfa741938aa04d9ea43464f951bd72bc">
<text hooks="math"><![CDATA[
If we are certain that our %k-means algorithm will have a relatively small (e.g., constant) number of means, we can take advantage of this by only tracking the means in a local variable and using <code>.updateMany()</code> to distribute the means to all the points at the beginning of each iteration. This leads to a much more concise (and, for a small number of means, efficient) implementation of the algorithm than what is presented in a <a href="#bca743938aa04d9ea43464f941bd70bc">previous example</a>. In particular, it is no longer necessary to encode a production operation within MongoDB.
]]></text>
<code class="js"><![CDATA[
db.system.js.save({ _id:"dist" , value:function(u, v) {
return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);
var means = [{x:13,y:1}, {x:2,y:12}];
do {
db.P.updateMany({}, {$set: {means: means}}); // Add a field to every object.
db.P.mapReduce(
function() {
var closest = this.means[0];
for (var i = 0; i < this.means.length; i++)
if (dist(this.means[i], this) < dist(closest, this))
closest = this.means[i];
emit(closest, {x:this.x, y:this.y, c:1});
},
function(key, vs) {
var x = 0, y = 0, c = 0;
vs.forEach(function(v, i) {
x += v.x;
y += v.y;
c += v.c;
});
return {x:x, y:y, c:c};
},
{ finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
out: "M"
}
);
means = db.M.find().toArray().map(function(r) { return {x:r.value.x, y:r.value.y}; });
printjson(means);
} while (true);
]]></code>
<text hooks="math"><![CDATA[
We do not deal with the issue of convergence in the above example; an equality function on JSON/BSON objects (i.e., the list of means) would need to be defined to implement the loop termination condition. Below, we illustrate how the above implementation can be written within Python using PyMongo.
]]></text>
<code class="py"><![CDATA[
import pymongo
import bson.code
client = pymongo.MongoClient()
db = client.local
db.system.js.save({'_id':'dist', 'value': bson.code.Code("""
function(u, v) {
return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}
""")})
db.P.insert_many([{'x':1,'y':2},{'x':4,'y':5},{'x':1,'y':3},{'x':10,'y':12},\
{'x':13,'y':14},{'x':13,'y':9},{'x':11,'y':11}])
means = [{'x':13,'y':1}, {'x':2,'y':12}]
while True:
db.P.update_many({}, {'$set': {'means': means}})
mapper = bson.code.Code("""
function() {
var closest = this.means[0];
for (var i = 0; i < this.means.length; i++)
if (dist(this.means[i], this) < dist(closest, this))
closest = this.means[i];
emit(closest, {x:this.x, y:this.y, c:1});
}
""")
reducer = bson.code.Code("""
function(key, vs) {
var x = 0, y = 0, c = 0;
vs.forEach(function(v, i) {
x += v.x;
y += v.y;
c += v.c;
});
return {x:x, y:y, c:c};
}
""")
finalizer = bson.code.Code("""
function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; }
""")
db.P.map_reduce(mapper, reducer, "M", finalize = finalizer)
means = [{'x':t['value']['x'], 'y':t['value']['y']} for t in db.M.find()]
print(means)
]]></code>
</example>
<example required="true" id="7eee633a65814aacb951b667e38092ec">
<text hooks="math"><![CDATA[
We can also use building blocks drawn from the relational model (defined for Python in <a href="#9da373c4cc654556bf2fa3fed6d56995">an example above</a>) to construct an implementation of the Floyd-Warshall shortest path algorithm.
]]></text>
<code class="py"><![CDATA[
N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]
oo = float('inf') # This represents infinite distance.
P = product(N,N)
I = [((x,y),oo if x != y else 0) for (x,y) in P] # Zero distance to self, infinite distance to others.
D = [((x,y),1) for (x,y) in E] # Edge-connected nodes are one apart.
OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
STEP = OUTPUT
P = product(STEP, STEP) # All pairs of edges.
NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y]) # Add distances of connected edge pairs.
OUTPUT = aggregate(NEW, min) # Keep only shortest node-node distance entries.
SHORTEST = OUTPUT
]]></code>
</example>
</subsection>
<subsection title="Data Provenance">
<text hooks="math"><![CDATA[
<i>Data provenance</i> is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets (or relationships between individual entries in those data sets) that may be derived from one another (usually over time). As an area of study within computer science, data provenance (also called <i>data lineage</i>) is arguably still being delineated and developed by the community. However, some community standards for general-purpose representations of data provenance have been established (e.g., PROV [<a href="#PROV-Primer">#</a>]).
]]></text>
<paragraph hooks="math"><![CDATA[
While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [<a href="#ilprints918">#</a>]):
<ul>
<li>from <i>where</i> the data was generated (e.g., what data sets or individual data entries) and <i>how</i> it was generated (e.g., what algorithms were used);</li>
<li>whether the lineage is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).</li>
</ul>
]]></paragraph>
<table id="5a30782285aa11e4af63feff819cdc9f"><![CDATA[
<table class="fig_table">
<tr>
<td></td>
<td><b>coarse</b></td>
<td><b>fine</b></td>
</tr>
<tr>
<td><b>how</b></td>
<td>transformation</td>
<td>execution path through transformation algorithm</td>
</tr>
<tr>
<td><b>from where</b></td>
<td>source data sets</td>
<td>specific entries in source data sets</td>
</tr>
</table>
]]></table>
<paragraph hooks="math"><![CDATA[
When data lineage is being tracked at a fine granularity, there are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One approach is to track the provenance of every entry within the transformation itself (sometimes called <i>tracing</i>); another approach is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be inspected, modified, or extended before they are applied to input data sets.
]]></paragraph>
<paragraph><![CDATA[
For a transformation that may combine input data set entries in some way, a large number of entries in the input data set can influence an individual entry in the output data set. For such transformations, finding the per-entry provenance for an entry in the output data set can be non-trivial. Without additional information about the transformation, the conservative assumption to make is that all entries in the input data set may contribute to every entry in the output data set.
]]></paragraph>
<paragraph hooks="math"><![CDATA[
In some cases, we may know more about a transformation. Perhaps its definition is broken down into <a href="#2.1">building blocks found in the relational model</a> (e.g., selections and projections), or it can be defined using map and reduce operations in the MapReduce paradigm. In such cases, studying the data lineage of individual data items it may be possible to derive standard techniques [<a href="#Cui:2003:LTG:775452.775456">#</a>] for determining data lineage given the components that make up the transformation.
]]></paragraph>
<fact required="true" hooks="math" id="f3951db0b6c94dd4b409e9ebb28bd2cd">
<text><![CDATA[
For relational transformations that simply add or remove data set entries without changing their internal structure (i.e., schema) or data content (i.e., values), computing the per-entry provenance is relatively straightforward. In particular: for union, difference, intersection, selection, and product transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the fine-grained "where" provenance of an individual entry in the output) can be determined using a linear search through the input data set (or sets) for that entry.
]]></text>
</fact>
<fact required="true" hooks="math" id="a1151db0b6c94dd4b409e9ebb28bd2cd">
<text><![CDATA[
For projection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set can be determined by applying the transformation to each entry in the input data set. Whenever this yields an output equivalent to the target, we know that the provenance for that output entry includes that input entry.
]]></text>
</fact>
<paragraph hooks="math"><![CDATA[
The above facts imply that for several relational transformation building blocks, per-entry provenance of output data set entries can be reconstructed efficiently. Note that by induction this also implies that any transformation composed of these building blocks is also amenable to efficient reconstruction of the provenance of any output data set entry.
]]></paragraph>
<paragraph hooks="math"><![CDATA[
If a transformation might <i>combine</i> subsets of the input entries to compute individual output entries (but we have no additional information about the transformation), then with regard to per-entry provenance a worst case scenario may apply.
]]></paragraph>
<fact required="true" hooks="math" id="f3951db0b6c94dd4b409e9ebb28bd2ca">
<text><![CDATA[
Suppose we know nothing about the internal structure of a transformation, but we want to determine what entries in the input data set influence a particular entry in the output data set. In this case, even running the entire transformation a second time (knowing the target entry) may provide no additional information about which subset of input data set entries produced that output data set entry. In this worst-case scenario, it would be necessary to run the transformation on all 2^{%n} subcollections of the input data set (for an input data set of size %n), and to check for each one of those 2^{%n} outputs whether the particular entry of interest from the output data set was generated.
]]></text>
</fact>
<paragraph hooks="math"><![CDATA[
However, it is possible that a transformation that combines input entries into output entries falls into one of a small number of categories that make it possible to compute per-entry provenance more efficiently.
]]></paragraph>
<fact required="true" hooks="math" id="f3951db0b6c94dd4b409e9ebb28bd2cc">
<text><![CDATA[
A transformation is <i>context-free</i> if any two entries in the input data set that contribute to the same output entry always do so (regardless of what other input entries are present in the input data set). If a transformation is context-free, then the provenance information for a given output data set entry can be computed in quadratic time. First, we can partition the input data set using the entries of the output data set (i.e., create one bucket of input data set entries for each output data set entry). Then, we can determine which partition contributed to the output data set entry of interest using linear search.
]]></text>
</fact>
<fact required="true" hooks="math" id="f3951db0b6c94dd4b409e9ebb28bd2cb">
<text><![CDATA[
A transformation is <i>key-preserving</i> if any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set. If a transformation is key-preserving, then the provenance information for a given entry is easy to determine by building up a map from input data set keys to output data set keys, and then performing a linear search over the input data set.
]]></text>
</fact>
<example required="true" hooks="math" id="f3951db0b6c23dd4b409e9ebb28bd2fd">
<text><![CDATA[
Any <i>relational aggregation</i> (that is, a key-based aggregation operation as defined in the <a href="#2.1">relational model</a> that produces an output data set in which all keys are unique) is always key-preserving and context-free.
]]></text>
</example>
<paragraph hooks="math"><![CDATA[
The above example completes our understanding of the complexity of reconstructing per-entry provenance for all the <a href="#9da373c4cc654556bf2fa3fed6d56995">building blocks in the relational model</a>.
]]></paragraph>
<table hooks="math" id="5a307822b5aa11e4af63feff819cdc9a"><![CDATA[
<table class="fig_table">
<tr>
<td><b>transformation %f</b></td>
<td><b>per-entry provenance<br/>reconstruction approach</b></td>
<td><b>complexity for input data<br/>set with %n entries</b></td>
</tr>
<tr>
<td>union<br/>intersection<br/>difference<br/>selection<br/>product</td>
<td>linear search over input data set<br/>to find entry</td>
<td>O(%n) entry equality checks</td>
</tr>
<tr>
<td>projection</td>
<td>application of transformation once<br/>to each input data set entry</td>
<td>O(%n) executions of %f and<br/>O(%n) entry equality checks</td>
</tr>
<tr>
<td>aggregation</td>
<td>application of transformation once<br/>to each input data set entry and<br/>construction of key-to-key map</td>
<td>O(%n) executions of %f and<br/>O(%n \log %n) to build and use<br/>key-to-key map</td>
</tr>
</table>
]]></table>
<example required="true" hooks="math" id="f3951db0b5c23dd4b109e9ebb28bd2ac">
<text><![CDATA[
A transformation %f that takes a data set of points as its input, runs the %k-means algorithm on those points, and returns a data set of means is <i>not</i> context-free. To see why, consider its behavior on following inputs for %k = 2:
\begin{eqnarray}
%f([(0,0), (2,2)]) & = & [(0,0), (2,2)] \\
%f([(0,0), (2,2), (100,100)]) & = & [(1,1), (100,100)]
\end{eqnarray}
Notice that in the first case, the input entries (0,0) and (2,2) each produce their own output entry. However, the introduction of a new point (100,100) results in (0,0) and (2,2) both contributing to the same output entry (1,1).
]]></text>
</example>
<example required="true" hooks="math" id="f3951db0b5c23dd4b109e9ebb28bd2ab">
<text><![CDATA[
A transformation can be context-free but not key-preserving. For example, suppose a transformation aggregates some vectors by key but then discards the key via a projection:
\begin{eqnarray}
[(%j, (0,2)), (%j, (2,0)), (%k, (0,3)), (%k, (3,0))] & \mapsto & [(%j, (2,2)), (%k, (3,3))] & \mapsto & [(2,2), (3,3)]
\end{eqnarray}
The above is context-free because there is no way that other input entries can have any influence over the way existing entries aggregate by key. However, they key %j might map to any numeric value key in the output (depending on what the input entries are):
\begin{eqnarray}
[(%j, (0,0))] & \mapsto & [(0,0)] \\
[(%j, (0,0)), (%j, (1,1))] & \mapsto & [(1,1)]
\end{eqnarray}
]]></text>
</example>
<theorem required="false" hooks="math" id="f3951db0b1c94dd4bbbbe9ebb28bd2cb">
<text><![CDATA[
Any key-preserving transformation is context-free. To see this, suppose that there is some transformation %f that is key-preserving but not context-free. This would imply that there is some set of inputs (%i, %a), (%j, %b), and (%k, %c) for which the definition of context-free is not satisfied, such as a case in which two input entries each map to their own distinct output entries when on their own but map to the same output entry when the third input entry (%k, %c) is introduced:
\begin{eqnarray}
%f([(%i, %a), (%j, %b)]) & = & [(%l, %d)] \\
%f([(%i, %a), (%j, %b), (%k, %c)]) & = & [(%l, %d), (%m, %e), (%n, %f)]
\end{eqnarray}
However, the above would imply that under different conditions, the keys %i and %j might either both map to a key %l or might map to two different keys %l and %m. But this would imply that entries with the key %j map to entries with either key %l or key %m. This is not key-preserving and contradicts our assumptions, so our premise must have been impossible.
]]></text>
</theorem>
<text hooks="math"><![CDATA[
The diagram below illustrates the relationships between the properties discussed above. Notice that whether a transformation is context-free and key-preserving is used to determine whether a transformation that combines input entries might have properties that allow us to compute per-entry provenance more efficiently. There is no need to check whether other transformations (such as projections and products) have these properties because per-entry provenance is <a href="#f3951db0b6c94dd4b409e9ebb28bd2cd">already efficiently computable in those cases</a>.
]]></text>
<diagram hooks="math" id="123bf7b584394a8bb3a62e9be3fae8dc"><![CDATA[
<table class="container">
<tr>
<td class="box" style="background-color:lightgrey;">
data set transformations
<table class="container">
<tr>
<td class="box" style="background-color:powderblue;">
transformations that combine multiple<br/>input entries into output entries
<br/><span style="font-weight:normal; font-style:italic;">(example: %k-means)</span>
<table class="container">
<tr>
<td class="box" style="background-color:lightgreen;">
context-free transformations
<br/><span style="font-weight:normal; font-style:italic;">(example: <a href="#f3951db0b5c23dd4b109e9ebb28bd2ab">aggregation by key that drops the key</a>)</span>
<table class="container">
<tr>
<td class="box" style="background-color:lightyellow; padding-bottom:4px;">
key-preserving transformations
<br/><span style="font-weight:normal; font-style:italic;">(example: relational aggregation by key)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="box" style="background-color:powderblue;">
other transformations
<br/><span style="font-weight:normal; font-style:italic;">(examples: selections, projections)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
]]></diagram>
<paragraph hooks="math"><![CDATA[
We provide explicit algorithms for computing per-entry provenance in three of the cases outlined above.
]]></paragraph>
<example required="true" id="f3951db0b6c94dd4b409e9ebb28bddaa">
<text hooks="math"><![CDATA[
If we introduce a function for computing powersets of entries in a data set (e.g., using a <a href="https://docs.python.org/3/library/itertools.html#itertools-recipes">Python recipe</a> for such a function), we can define an inefficient Python algorithm that can correctly determine per-entry provenance in the <a href="#f3951db0b6c94dd4b409e9ebb28bd2ca">general case</a> (i.e., not knowing any additional information about a transformation <code>f</code>).
]]></text>
<code class="py"><![CDATA[
from itertools import combinations, chain
def powerset(iterable):
"powerset([1,2,3]) -> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
def general_prov(f, X, y):
for Y in reversed(powerset(X)): # From largest to smallest subset.
if list(f(list(Y))) == [y]:
return Y
]]></code>
</example>
<example required="true" id="f3951db0b6c94dd4b409e9ebb28bd216">
<text hooks="math"><![CDATA[
The function below determines the per-entry provenance for a <a href="#f3951db0b6c94dd4b409e9ebb28bd2cc">context-free transformation</a>. For this example, we assume that all data set entries are of the form (<i>key</i>, <i>value</i>).
]]></text>
<code class="py"><![CDATA[
def context_free_prov(f, X, y):
# Build the partitions of the input data set.