-
Notifications
You must be signed in to change notification settings - Fork 146
/
Copy pathRELEASE_NOTES
1526 lines (1339 loc) · 70.2 KB
/
RELEASE_NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
MADlib Release Notes
--------------------
These release notes contain the significant changes in each MADlib release,
with most recent versions listed at the top.
A complete list of changes for each release can be obtained by viewing the git
commit history located at https://github.com/apache/madlib/commits/master.
Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB.
—-------------------------------------------------------------------------
MADlib v1.21.0:
Release Date: 2023-March-01
New features:
- Graph: Add warm start for weakly connected components
- Graph: Add multicolumn identifier support for SSSP and APSP
- Build: Add support for Photon3 OS
Improvements:
- XGBoost: Add support for bigint and varchar columns
- XGBoost: Enable eval_metrics parameter
Bug fixes:
- XGBoost: Fix class label verification
- Graph: Fix SSSP negative cycle check
- Build: Disable TestIfNoUTF8BOM.py
—-------------------------------------------------------------------------
MADlib v1.20.0:
Release Date: 2022-August-05
New features:
- XGBoost: Python based XGBoost with single and grid search executions (MADLIB-1425, MADLIB-1490)
- Graph: Add multicolumn support for WCC and Pagerank (MADLIB-1502, MADLIB-1503)
Improvements:
- Utilities: Reuse update plan in GroupIterationController
- Documentation: Update online examples for various modules
Bug fixes:
- Elastic Net - GLM - SVM: Adjust ORCA to reduce planning time
—-------------------------------------------------------------------------
MADlib v1.19.0:
Release Date: 2022-March-09
NOTE: This release changes MADlib library location and updates UDF definitions.
To ensure this change is reflected in the database, pg_ctl (for Postgres) or gpstop (for Greenplum) has to be in the PATH at the time of madpack installation.
Views depending directly on madlib defined function(s) have to be dropped
before upgrading.
New features:
- DBSCAN: Fast parallel-optimized DBSCAN (MADLIB-1017, MADLIB-867)
- MLP: Add rmsprop and Adam optimization techniques (MADLIB-1434, MADLIB-1435)
Improvements:
- Graph: Improve WCC subtx count and catalog entry frequency (MADLIB-1492)
- MLP: Set lambda value for minibatch
Bug fixes:
- PMML: Rename builder.py to circumvent ignore scripts
- DL: Fix object table schema error message
- DL: Fix metrics error message
- DL: Remove AOcontrol from load_top_k_accuracy fn
- GLM-multinom: Use non-temp tables in GroupIterationController
Other:
- DEBUG: Add WithTracebackForwarding() macro and report_segment_tracebacks param
- Jenkins: Add new dockerfile for PG11 (MADLIB-1469)
- Build: Use dynamic_library_path for module pathname
- Mac OS support is dropped until further notice
—-------------------------------------------------------------------------
MADlib v1.18.0:
Release Date: 2021-Mar-16
New features
- DL: setup methods for grid search and random search (MADLIB-1439)
- DL: Add support for custom loss functions (MADLIB-1441)
- DL: Hyperband phase 1 - print run schedule (MADLIB-1445)
- DL: Hyperband phase 2 - generate MST table (MADLIB-1446)
- DL: Hyperband phase 3 - logic for diagonal runs (MADLIB-1447)
- DL: Hyperband phase 4 - implement full logic with default params (MADLIB-1448)
- DL: Hyperband phase 5 - implement full logic with optional params (MADLIB-1449)
- AutoML: add Hyperopt for deep learning (MADLIB-1453)
- DL: Add Multiple input/output support to load, fit, and evaluate (MADLIB-1457)
- DL: Add multiple input/output support on advanced features (MADLIB-1458)
- DL: add caching param to autoML interface (MADLIB-1461)
- DL: Add support for TensorBoard (MADLIB-1474)
- DBSCAN clustering algo - phase 1 (MADLIB-1017)
Improvements:
- DL: cache data to speed training (MADLIB-1427)
- DL: reduce GPU idle time between hops (MADLIB-1428)
- DL: utility to load and delete custom Python functions (MADLIB-1429)
- DL: support custom loss functions (MADLIB-1432)
- DL: support custom metrics (MADLIB-1433)
- DL: Fit multiple does not print timing for validation evaluate (MADLIB-1462)
- DL: Fix gpu_memory_fraction for distribution_policy != 'all_segments' (MADLIB-1463)
- DL: add object table info in load MST table utility function (MADLIB-1430)
- DL: improve speed of evaluate for multiple model training (MADLIB-1431)
- DL: improve existing grid search method (MADLIB-1440)
- DL: Remove dependency on keras (MADLIB-1450)
- DL: Improve output of predict (MADLIB-1451)
- DL: Add top n to evalute() (MADLIB-1452)
- DL - Write best so far to console for autoML methods (MADLIB-1454)
- Do not try to drop output tables (MADLIB-1442)
- Prevent an "integer out of range" exception in linear regression train (MADLIB-1460)
Bug fixes:
- DL: Fix fit_multiple when output_table or mst_table is passed as NULL (MADLIB-1464)
- DL: Iris predict accuracy has regressed (MADLIB-1465)
- DL: madlib_keras_fit_multiple_model goes down with an IndexError: tuple index out of range (MADLIB-1467)
- DL: Crash in fit_multiple when any model reaches loss=nan (MADLIB-1443)
- DL: BYOM fails at get_num_classes (MADLIB-1472)
- DL: Hyperband cumulative output time is not correct (MADLIB-1456)
- check bigint support for all graph methods (MADLIB-1444)
- MLP: weights param not working (MADLIB-1471)
Other:
- Create build trigger jobs on cloudbees (MADLIB-1466)
—-------------------------------------------------------------------------
MADlib v1.17.0:
Release Date: 2020-Mar-31
New features
- DL: Add optional params to madlib_keras_fit_multiple_model (MADLIB-1397)
- DL: Fit and evaluate changes for asymmetric cluster config (MADLIB-1393)
- DL: Make param search fit() function work with existing evaluate and predict (MADLIB-1387)
- DL: ParamSearch: Add utility function for generating model selection table (MADLIB-1375)
- DL: Predict changes for asymmetric cluster config (MADLIB-1394)
- DL: Preprocessor should evenly distribute data on an arbitrary number of segments (MADLIB-1378)
- DL: Preprocessor support for asymmetric segment distribution (MADLIB-1392)
- DL: Remove model_arch_table column from the output of load_model_selection_table (MADLIB-1381)
- DL: Support DL predict without training on MADlib (MADLIB-1359)
- DL: Transfer learning for multi-model (MADLIB-1389)
- Kmeans: Add simple silhouette score for every point (MADLIB-1382)
- Kmeans: Select number of centroids in k-means (MADLIB-1380)
- PostgreSQL 12 support (MADLIB-1391)
Improvements:
- Assoc rules: Add option to set number of posterior in association rules (MADLIB-1327)
- Correlation: Improve correlation and covariance memory usage with large number of groups (MADLIB-1301)
- DL: helper function for asymmetric cluster config (MADLIB-1390)
- DL: Mini-batch preprocessor for images - performance issue (MADLIB-1342)
- DL: Modify warm start logic for DL to handle case of missing weight (MADLIB-1400)
- DL: Param search for multiple models on MPP architecture (MADLIB-1386)
- DL: performance improvements to fit transition function (MADLIB-1418)
- Docs: Enhance Installation Guides (MADLIB-1399)
- Graph: SSSP should not show vertices in output table that are unreachable (MADLIB-1415)
- Knn - add zero check and output distance array (MADLIB-1370)
- LDA: Add stopping criteria on perplexity to LDA (MADLIB-1351)
- Summary: Last optional param in summary errors when NULL (MADLIB-1413)
- Summary: Summary function has dups for MFV for approximate results (MADLIB-1412)
- SVM: Change default num_components for SVM to max(100, 2*num_features) (MADLIB-1384)
Bug fixes:
- DL: Deep Learning module does not work with tables in non-public schemas (MADLIB-1388)
- DL: Exception during madlib_keras_fit when model_arch_id is passed as NULL (MADLIB-1371)
- DL: fit and fit multiple fail with memory exception in gpdb6 (MADLIB-1405)
- DL: fit multiple takes up unnecessary disk space (MADLIB-1406)
- DL: Intermediate tables are not dropped (MADLIB-1404)
- DL: MADlib Keras operations create too many threads (MADLIB-1372)
- DL: metrics_elapsed_time for fit multi_model not captured correctly (MADLIB-1403)
- DL: predict fails with OOM in gpdb6 (MADLIB-1414)
- DL: Remove final function for fit multiple (MADLIB-1416)
- DL: Support schema qualified output tables for fit and fit_multiple (MADLIB-1417)
- Graph: APSP fails if both vertex id column and edge src column has the same name (MADLIB-1407)
- Graph: ASPS Path Function fails if src or dest column type is bigint (MADLIB-1408)
- Graph: Graph/wcc fails if the user specifies a schema for the output table (MADLIB-1411)
- Kmeans: k-means related functions must use same default distance function (MADLIB-1383)
- LDA: Term frequency and LDA - turn off notices (MADLIB-1395)
- MADlib cannot be built on PowerPC machines with Linux (MADLIB-1410)
- Pivot: Pivot documentation should say "out_table" instead of "output_table" (MADLIB-1376)
Other:
- DL: Support up to Keras version 2.2.4, Tensorflow version 1.14
- DL: If 'madlib_keras_fit_multiple_model()' is running on GPDB 5 and some versions of GPDB 6, the database will keep adding to the disk space (in proportion to model size) and will only release the disk space once the fit multiple query has completed execution. This is not the case for GPDB 6.5.0+ where disk space is released during the fit multiple query.
- DL: CUDA GPU memory cannot be released until the process holding it is terminated. This process holds the GPU memory until one of the following two things happen: query finishes and user logs out of the Postgres client/session; or, query finishes and user waits for the timeout set by `gp_vmem_idle_resource_timeout`. The default value for this timeout in Greenplum is 18 sec, but it can be changed.
- DL: pg_temp is not allowed as an output table schema for madlib_keras_fit_multiple_model().
- Build: Enable current versions of bison
- Build: Add cmake variable for gppkg filename
- Build: Add pull request template
—-------------------------------------------------------------------------
MADlib v1.16:
Release Date: 2019-Jul-02
New features:
- Deep learning: support for Keras with TensorFlow backend with GPU acceleration (MADLIB-1268, MADLIB-1304, MADLIB-1305, MADLIB-1307, MADLIB-1308, MADLIB-1309, MADLIB-1310, MADLIB-1311, MADLIB-1313, MADLIB-1314, MADLIB-1315, MADLIB-1316, MADLIB-1319, MADLIB-1321, MADLIB-1326, MADLIB-1330, MADLIB-1335, MADLIB-1336, MADLIB-1338, MADLIB-1343, MADLIB-1348, MADLIB-1349, MADLIB-1350, MADLIB-1356, MADLIB-1357, MADLIB-1358, MADLIB-1324, MADLIB-1337, MADLIB-1347, MADLIB-1363)
- Deep learning: utility to load model architectures and weights (MADLIB-1306)
- Deep learning: preprocess images for gradient descent optimization algorithms (MADLIB-1290, MADLIB-1332, MADLIB-1334, MADLIB-1300, MADLIB-1303)
- kd-tree method for k-nearest neighbors for faster approximate solution (MADLIB-1061, MADLIB-1293)
- Support for Greenplum 6 (MADLIB-1298)
- Support for PostgreSQL 11 (MADLIB-1283)
Bug fixes:
- Jaccard distance not releasing memory (MADLIB-1291)
- MLP with minibatching fails on postgres (MADLIB-1302)
- MLP does not stop even after tolerance reached (MADLIB-1325)
- MLP warm start not working (MADLIB-1329)
- MLP with minibatch fails for integer dependent variable on PostgreSQL (MADLIB-1322)
- MLP fix column name in output table (MADLIB-1323)
- Pivot: Fix array_agg + distinct scaling issue on gpdb (MADLIB-1361)
- linregr_train fails when dependent variable is a JSONB element (MADLIB-1284)
- MADLib 1.15 does not recognize Postgres 10 declarative partitioned table (MADLIB-1287)
- Encoding module is not handling bigint properly (MADLIB-1295)
- SVM class_weight param not working properly (MADLIB-1346)
Other:
- Simplify maintenance via removing online examples from sql functions (MADLIB-1260)
- Improve performance for weakly connected components (MADLIB-1320)
- SVD minor messaging inprovements (MADLIB-983)
- Create SQL scripts to get lists of changed UDOs and UDOCs (MADLIB-1281)
- Set max itemset size to 10 by default in assoc rules (MADLIB-1288)
- Misc messages for 1.16 release (MADLIB-1364)
- Madlib 1.16 release tasks (MADLIB-1362)
—-------------------------------------------------------------------------
MADlib v1.15.1:
Release Date: 2018-Oct-15
New features:
- Add ubuntu support for MADlib (MADLIB-1256).
- Elastic Net: Add grouping by non-numeric column support (MADLIB-1262).
- KNN: Accept expressions for point_column_name and test_column_name (MADLIB-1060).
- Vec2Cols: Allow arrays of different lengths (MADLIB-1270).
- Madpack: Add a script for automating changelist creation.
Bug fixes:
- Allocator: Remove 16-byte alignment in GPDB 6.
- Build: Download compatible Boost if version >= 1.65 (MADLIB-1235).
- Build: Remove primary key constraint in IC/DC.
- CMake: Fix false positive for Postgres 10+ check.
- Graph: Add id of nodes with 0 in-degree (MADLIB-1279).
- Margins: Copy summary table instead of renaming (MADLIB-1276).
- MLP: Simplify momentum and Nesterov updates (MADLIB-1272).
- Upgrade: Fix issue with upgrading RPM to 1.15.1 (MADLIB-1278).
- Utilities: Use plpy.quote_ident if available.
Others:
- Simplify maintenance via removing online examples from sql functions (MADLIB-1260).
- Re-enable PCA and PageRank tests (MADLIB-1264).
- Build: Disable AppendOnly if available ( MADLIB-1273).
- Improve documentation of various modules.
—-------------------------------------------------------------------------
MADlib v1.15:
Release Date: 2018-Aug-15
New features:
* MLP: Added momentum and Nesterov's accelerated gradient methods to gradient
updates (MADLIB-1210).
* New modules:
- drop_cols: Create new table from an existing table (CTAS) using an
expression of column names (MADLIB-1241).
- cols2vec: Create an array from multiple columns (similar to ARRAY[...]
with columns obtained using an expression) (MADLIB-1239).
- vec2cols: Create multiple columns from an existing array (MADLIB-1240).
* Statistics: Added grouping support to correlation and covariance
functions (MADLIB-1128).
* DT/RF:
- Added impurity importance values in DT and RF (MADLIB-1205, 1246, 1249).
- Added a new function (get_var_importance) to report importance values
in an cleaner interface (MADLIB-925).
* Madpack:
- Refactored and updated the installation scripts to ensure install,
reinstall, install-check are all run from a single SQL file as an atomic
operation (MADLIB-1242).
- Moved most of install-check operations to a new "dev-check", making
install-check smaller and faster to run.
- Added new option to run unit-tests (MADLIB 1251, 1252).
Bug fixes:
- Fixed an ABI issue that prevented compiling MADlib on GCC 5+
(MADLIB-1025).
- Decision trees:
- Fixed a minor bug that prevented sparse vector to float8[]
(MADLIB-1234).
- Fixed a bug that led to dependent type being obtained from a NULL
value (MADLIB-1233).
- Summary table has been updated to ensure correct feature names are
populated (MADLIB-1236).
- Fixed incorrect indexing of trueChild and falseChild in surrogate
agreement calculation.
- Removed categorical variable elimination to avoid issues with varying
categorical variables for different groups (MADLIB-1258, 1254).
- Logregr: Fixed issue where an output table could be empty for grouping
(MADLIB-1172).
- Added special characters support for multiple modules
(MADLIB-1237, 1238, 1243).
- Build: Removed invalid symlinks left behind after an uninstall
(MADLIB-1175).
- Updated SVM to correctly report loss per row instead of total loss.
- Refactored internal CV function to fix multiple issues with cross
validation on SVM (MADLIB-1250).
- Worked-around a "cache lookup" issue that prevented dropping of
install-check user (MADLIB-1014).
- Pagerank: Removed duplicate entries from grouping output
(MADLIB-1229, 1253).
- Madpack: Install-check user is dropped even after an IC failure
(MADLIB-1182).
Others:
- Removed HAWQ support from all modules
—-------------------------------------------------------------------------
MADlib v1.14:
Release Date: 2018-April-28
New features:
* New module - Balanced datasets: A sampling module to balance classification
datasets by resampling using various techniques including undersampling,
oversampling, uniform sampling or user-defined proportion sampling
(MADLIB-1168)
* Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor function
necessary to create batches from the data (MADLIB-1200, MADLIB-1206,
MADLIB-1220, MADLIB-1224, MADLIB-1226, MADLIB-1227)
* k-NN: Added weighted averaging/voting by distance (MADLIB-1181)
* Summary: Added additional stats: number of positive, negative, zero values and
95% confidence intervals for the mean (MADLIB-1167)
* Encode categorical: Updated to produce lower-case column names when possible
(MADLIB-1202)
* MLP: Added support for already one-hot encoded categorical dependent variable
in a classification task (MADLIB-1222)
* Pagerank: Added option for personalized vertices that allows higher weightage
for a subset of vertices which will have a higher jump probability as
compared to other vertices and a random surfer is more likely to
jump to these personalization vertices (MADLIB-1084)
Bug fixes:
- Fixed issue with invalid calls of construct_array that led to problems
in Postgresql 10 (MADLIB-1185)
- Added newline between file concatenation during PGXN install (MADLIB-1194)
- Fixed upgrade issues in knn (MADLIB-1197)
- Added fix to ensure RF variable importance are always non-negative
- Fixed inconsistency in LDA output and improved usability
(MADLIB-1160, MADLIB-1201)
- Fixed MLP and RF predict for models trained in earlier versions to
ensure missing optional parameters are given appropriate default values
(MADLIB-1207)
- Fixed a scenario in DT where no features exist due categorical columns
with single level being dropped led to the database crashing
- Fixed step size initialization in MLP based on learning rate policy
(MADLIB-1212)
- Fixed PCA issue that leads to failure when grouping column is a TEXT type
(MADLIB-1215)
- Fixed cat levels output in DT when grouping is enabled (MADLIB-1218)
- Fixed and simplified initialization of model coefficients in MLP
- Removed source table dependency for predicting regression models in MLP
(MADLIB-1223)
- Print loss of first iteration in MLP (MADLIB-1228)
- Fixed MLP failure on GPDB 4.3 when verbose=True (MADLIB-1209)
- Fixed RF issue that showed up when var_importance=True with no continuous
features (MADLIB-1219)
- Fixed DT/RF issue for null_as_category=True and grouping enabled
(MADLIB-1217)
Other:
- Reduced install-check runtime for PCA, DT, RF, elastic net (MADLIB-1216)
- Added CentOS 7 PostgreSQL 9.6/10 docker files
—-------------------------------------------------------------------------
MADlib v1.13:
Release Date: 2017-December-22
New features:
* New module: Graph - HITS (MADLIB-1124, MADLIB-1151)
* k-NN:
- Added additional distance metrics (MADLIB-1059)
- Added list of neighbors in output table (MADLIB-1129)
* MLP: Added grouping support (MADLIB-1149)
* Cross Validation: Improved the stats reporting in output table (MADLIB-1169)
* Correlation: Improved quality of results by ignoring only a NULL value and
not the whole row containing the NULL (MADLIB-1166)
Bug fixes:
- Fixed issue with Decision Trees (DT) trained in older versions not
being usable in predict of v1.12 (MADLIB-1161)
- Fixed invalid assert statement in DT (MADLIB-1164)
- Improved feature array handling in DT (MADLIB-1173)
- Fixed install-check failures on non-default schema installation (MADLIB-1177, 1184)
Other:
- Updated PyXB from 1.2.4 to 1.2.6. (MADLIB-1103)
This change eliminates the need to remove part of PyXB code base as a
GPL-workaround.
- Updated the naming for gppkg (MADLIB-1183)
—-------------------------------------------------------------------------
MADlib v1.12:
Release Date: 2017-August-18
New features:
* New module: Graph - All Pairs Shortest Path (MADLIB-1072, MADLIB-1099, MADLIB-1106)
* New module: Graph - Weakly Connected Components (MADLIB-1071, MADLIB-1083, MADLIB-1101)
* New module: Graph - Breadth First Search (MADLIB-1102)
* New module: Graph - Measures (MADLIB-1073)
* New Module: Sample - Stratified Sampling (MADLIB-986)
* New Module: Sample - Train-test split (MADLIB-1119)
* New Module: Multilayer Perceptron (MADLIB-413, MADLIB-1134)
* DT and RF:
- Allow expressions in feature list (MADLIB-1087)
- Allow array input for features (MADLIB-965)
- Filter NULL dependent values in OOB (MADLIB-1097)
- Add option to treat NULL as category
* Summary:
- Allow user to determine the number of columns per run (MADLIB-1117)
- Improve efficiency of computation time by ~35% (MADLIB-1104)
* Sketch:
- Promote cardinality estimators to top level module from early stage (MADLIB-1120)
* Add basic code coverage support (MADLIB-1138)
* Updates for Apache Top Level Project readiness (MADLIB-1112, MADLIB-1130, MADLIB-1133, MADLIB-1142)
Bug fixes:
- DT and RF:
- Fix array to string conversion with CV
- Include NULL rows in count for termination check
- Sketch:
- Remove per-tuple checks for better performance
- PageRank:
- Fix multiple bugs and perf issue in grouping (MADLIB-1100, MADLIB-1107)
- Kmeans:
- Fix IC drop table statements
- Graph:
- Fix quoted output table name bug (MADLIB-1137)
- Fix empty string arguments bug
- Elastic Net:
- Fix the data scaling bug with normalization (MADLIB-1094)
- Reduce the tolerance for a faster IC test (MADLIB-1118)
- Control:
- Update 'optimizer' GUC only if editable (MADLIB-1109)
Other:
- Build: Add CDATA block to avoid invalid xml
- Multiple user documentation improvements
—-------------------------------------------------------------------------
MADlib v1.11:
Release Date: 2017-May-05
New features:
* New module: Graph - PageRank
- Implements the original PageRank algorithm that assumes a random surfer model
(https://en.wikipedia.org/wiki/PageRank#Damping_factor) (MADLIB-1069)
- Grouping support is included for PageRank (MADLIB-1082)
* Graph - Single Source Shortest Path (SSSP): Add grouping support (MADLIB-1081)
* Pivot: Add support for array and svec output types (MADLIB-1066)
* DT and RF:
- Change default values for 2 parameters (max_depth and num_splits)
- Reduce memory footprint: Assign memory only for reachable nodes (MADLIB-1057)
- Include rows with NULL features in training (MADLIB-1095)
- Update error message for invalid parameter specification (num_splits)
* Array Operations: Add function to unnest 2-D arrays by one level into rows
of 1-D arrays (MADLIB-1086)
* Build process on Apache infrastructure (MADLIB-920, MADLIB-1080)
* Updates for Apache Top Level Project readiness (MADLIB-1022, MADLIB-1076,
MADLIB-1077, MADLIB 1090)
* Support for GPDB 5.0
Bug fixes:
- DT and RF:
- Fix accuracy issues related to integer categorical variables and tree depth
- Improve visualization of tree(s)
- Elastic Net:
- Fix install check on GPDB 5.0 and HAWQ 2.2 (MADLIB-1088)
- Fix inconsistent results with grouping (MADLIB-1092)
- PCA: Fix install check
Other:
- PMML: Skip install check when run without the ‘-t’ option (MADLIB-1078)
- Multiple user documentation improvements
—-------------------------------------------------------------------------
MADlib v1.10.0
Release Date: 2017-February-17
New features:
* New module: Graph - Single Source Shortest Path (SSSP) (MADLIB-992)
- Calculate the shortest path from a given vertex to every vertex in the graph.
* New module: Encode categorical variables (MADLIB-1038)
- Completely new version for dummy/one-hot encoding of categorical variables
with new name and different arguments.
- Previous version has been deprecated.
* New module (early stage): K-Nearest Neighbors (KNN) (MADLIB-927)
- Find the k nearest neighbors based on the squared_dist_norm2 metric.
* Elastic Net: Add grouping support (MADLIB-950)
- Elastic net train for both Gaussian and Binomial models, with FISTA
and IGD optimizations support grouping.
- Use active sets for FISTA, but active sets are used only after the
log-likelihood of all the groups becomes 0.
* Elastic Net: Add cross validation (MADLIB-996)
* PCA: Add grouping support (MADLIB-947)
* PCA: Removed column id restriction.
* Kmeans: Cluster variance for PivotalR support.
* Kmeans: Support for array input. (MADLIB-1018)
* DT and RF: Verbose option for the dot output format. (MADLIB-1051)
* Association Rules: Add rule counts and limit itemset size feature
(MADLIB-1044, MADLIB-1031)
* Boost library has been upgraded from 1.47 to 1.61
* Multiple improvements to the build system (madpack, cmake etc.) to support
Semantic versioning and various versions of GPDB and HAWQ.
Bug fixes:
- Pivot: Adjust the warning level to remove redundant messages.
- RF: Fix the online help and examples.
- Utilities: Fix incorrect flag for distribution.
- Install check: Update date format and remove hardcoded schema names.
- Multiple user documentation improvements.
—-------------------------------------------------------------------------
MADlib v1.9.1
Release Date: 2016-August-25
New features:
* New function: One class SVM (MADLIB-990)
- Added a one-class SVM that classifies new data as similar or different to
the training set.
- This method is an unsupervised method that builds a decision boundary
between the data and origin in kernel space and can be used as a novelty
detector.
* SVM: Added functionality to assign weights to each class, simplying
classification of unbalanced data. (MADLIB-998)
* New function: Prediction metrics (MADLIB-907)
Added a collection of summary statistics to gauge model accuracy based on
predicted values vs. ground-truth values.
* New function: Sessionization (MADLIB-909, MADLIB-1001)
Added a sessionize function to perform session reconstruction on a data
set so it can be prepared for input into other algorithms such as
path functions or predictive analytics algorithms.
* New function: Pivot (MADLIB-908, MADLIB-1004)
Added a function to that can do basic OLAP type operations on data stored
in one table and output the summarized data to a second table.
* Path: Major performance improvement (MADLIB-984)
* Path: Add support for overlapping patterns (MADLIB-995)
* Build: Add support for PG 9.5 and 9.6 (MADLIB-944)
* PGXN: Update PostgreSQL Extension Network to latest release (MADLIB-959)
Bug fixes:
- Random Forest: Fix filtered feature related bug (MADLIB-928)
- Elastic Net: Skip arrays with NULL values in train (MADLIB-978)
- Matrix: Fix starting index in extract functions (MADLIB-1006)
- Path: Allow multiple expressions in partition expression (MADLIB-1003)
- DT: Fix bin computation for boolean features (MADLIB-1011)
- Multiple user documentation improvements (MADLIB-1001)
—-------------------------------------------------------------------------
MADlib v1.9
Release Date: 2016-April-04
New features:
* New module: Path
- Perform pattern matching over a sequence of rows and extracts useful
information about the pattern matches.
- Useful in a wide variety of use cases: on-line shopping, predictive
maintenance, cyber security, IoT, customer churn, etc.
- Define arbitrarily complex symbols to identify rows of interest.
- Perform regular pattern matching of symbols over a sequence of ordered partitions.
- Extract useful information about the pattern matches (counts,
aggregations, window functions).
* New module: Support Vector Machines (SVM)
- Complete rewrite of SVM algorithm to improve accuracy and performance.
- Support for classification and regression.
- Support for non-linear kernels (Gaussian and Polynomial).
- Cross validation support on parameters: lambda, epsilon, initial step size,
maximum iterations, and decay factor.
* New module: Stemmer function
- Compute the root of any English text input using Porter2 stemming algorithm.
* New matrix operations (Phase 2)
- Added following operations/functions for dense and sparse matrices:
- Representation: get matrix dimensions
- Extraction/visitor methods: extract diagonal elements
- Reduction operations: compute matrix norm
- Creation methods: initialize with ones, initialize with zeros,
square identity matrix, diagonal matrix, sample from distribution
(Normal, Uniform, Bernoulli)
- Decomposition operations: inverse, generic inverse, eigen extraction,
Cholesky decomposition, QR decomposition, LU decomposition, nuclear norm, rank
* Pearson's correlation module: added option to return the covariance matrix
* PCA: added option to use proportion of variance to determine number of
principle components to return (MADLIB-948)
* PivotalR support for Latent Dirichlet Allocation (LDA)
* Quotation and international character support (Phase 2)
- All modules now support table and column names that are quoted and
contain international characters. This release adds support for:
- Cross Validation
- Dense Linear Systems
- Sparse Linear Systems
- Low-rank Matrix Factorization
- Conditional Random Field
- Hypothesis Tests
- Support Modules/Data Preparation
- Support Modules/PMML Export
- ARIMA
* New platform:
- Added support for HAWQ 2.0
* Miscellaneous:
- Updated documentation and more examples
- Term frequency: added support for custom column names
- Updated licensing files and headers to comply with ASF regulations
Bug fixes:
- Elastic Net: Skips arrays with NULL values in predict (MADLIB-919)
- Hello World example: Fixed 'this' pointer errors (MADLIB-967)
- Hypothesis tests: Fixed docs and examples (MADLIB-895)
- Matrix: Fixed inconsistent type in drop statements
- Decision Tree: Fixed format specifier in online help (MADLIB-968)
- Minor: Updated volatile install-check
- LDA: Fixed the padding for LDA model
- Decision tree: Fixed to cast count(*) output to long (MADLIB-917)
- Validation: Fixed varchar array error in install-check
- Matrix: Fixed multiple input/output issues (MADLIB-932)
- Matrix: Fixed minor issue with sparse LU output
- Summary: Fixed the case for unquoted table names by moving the compare to
SQL (MADLIB-954)
- Correlation: Fixed to return columns sorted in ordinal position. (MADLIB-941)
- Elastic Net: Removed the enforcement of same numeric type while keeping the
error for non-numeric types. (MADLIB-952)
- K-means: Fixed the error caused by a null value in the matrix or vector.
(MADLIB-946)
--------------------------------------------------------------------------------
MADlib v1.8
Release Date: 2015-July-17
New features:
* Improved Latent Dirichlet Allocation (LDA) Performance
- Function lda_train() is about twice as fast.
- Improved the scalability of the function
(vocabulary size x number of topics can be up to 250 million).
* New module: Matrix operations
Added the following operations/functions for dense and sparse matrices:
- Mathematical operations: addition, subtraction, multiplication,
element-wise multiplication, scalar and vector multiplication.
- Aggregation operations: apply various operations including
max, min, sum, mean along a specified dimension.
- Visitor methods: extract row/column from matrix.
- Representation: convert a matrix to either dense or sparse representation.
* Quotation and International Character Support
- Most modules now support table and column names that are quoted and
contain international characters, including:
- Regression models (GLMs, linear regression, elastic net, etc.)
- Decision trees and random forests
- Unsupervised learning models (association rules, k-means, LDA, etc.)
- Summary, Pearson's correlation, and PCA
* Array Norms and Distances
- Generic p-norm distance
- Jaccard distance
- Cosine similarity
* Text Analysis:
- Text utility for term frequency and vacabulary construction (prepares
documents for input to LDA).
* Miscellaneous
- Improved organization of User and Developer guide at doc.madlib.net/latest.
- Low-rank matrix factorization: added 32-bit integer aupport (MADLIB-903).
- Cross-validation: added classification support (MADLIB-908).
- Added a new clean-up function for removing MADlib temporary tables.
Note:
- LDA models that are trained using MADlib v1.7.1 or earlier need to be
re-trained to be used in MADlib v1.8.
Known issues:
- Performance for decision tree with cross-validation is poor on a HAWQ
multi-node system.
--------------------------------------------------------------------------------
MADlib v1.7.1
Release Date: 2015-March-18
New features:
* Random Forest Performance Improvement
- Function forest_train() is 1.5X ~ 4X faster without variable importance,
and up to 100X faster with variable importance
- Function forest_predict() is up to 10X faster when type='response'
- Allow user-specified sample ratio to train with a small subsample
* Gaussian Naive Bayes: allow continuous variables
* K-Means: Allow user-specified sample ratio for K-means++ seeding
* Miscellaneous
- Array functions: array_square() for element-wise square, madlib.sum()
for array element-wise aggregation
- Madpack does not require password when not necessary (MADLIB-357)
- Platform support of PostgreSQL 9.4 and HAWQ 1.3
- Allow views and materialized views for training functions
- Support quantile computation in summary functions for HAWQ and PG 9.4
Bug fixes:
- Fixed the support of multiple parameter values and NULL in general
cross-validation (MADLIB-898, MADLIB-896)
- Fixed infinite loop when detecting recursive view-to-view dependencies for
upgrading (MADLIB-901)
- Allow user-specified column names in PCA and multinom_predict()
Known issues:
- Performance for decision tree with cross-validation is poor on a HAWQ
multi-node system.
--------------------------------------------------------------------------------
MADlib v1.7
Release Date: 2014-December-31
New features:
* Generalized Linear Model:
- Added a new generic module for GLM functions that allow for response
variables that have arbitrary distributions (rather than simply
Gaussian distributions), and for an arbitrary function of the response
variable (the link function) to vary linearly with the predicted values
(rather than assuming that the response itself must vary linearly).
- Available distribution families: gaussian (link functions: identity,
inverse and log), binomial (link functions: probit and logit),
poisson (link functions: log, identity and square-root), gamma (link
functions: inverse, identity and log) and inverse gaussian (link functions:
square-inverse, inverse, identity and log).
- Deprecated 'mlogregr_train' in favor of 'multinom' available as part of
the new GLM functionality.
- Added a new 'ordinal' function for ordered logit and probit regression.
* Decision Tree: Reimplemented the decision tree module which includes following
changes:
- Improved usability due to a new interface.
- Performance enhancements upto 40 times faster than the old interface.
- Additional features like pruning methods, surrogate variables for
NULL handling, cross validation, and various new tree tuning parameters.
- Addition of a new display function to visualize the trained tree and new
prediction function for scoring of new datasets.
* Random Forest: Reimplemented the random forest module which includes following
changes:
- New random forest module based on the new decision tree module.
- Better variable importance metrics and ability to explore each tree
in the forest independently.
- Ability to get class probabilities of all classes and not just the max
class during prediction.
- Improved visualization with export capabilities using Graphviz dot format.
* PMML:
- Upgraded compatible PMML version to 4.1.
- Moved PMML export out of early stage development with new functionality
available to export GLM, decision tree, and random forest models.
* Updated Eigen from 3.1.2 to 3.2.2.
* Updated PyXB from 1.2.3 to 1.2.4.
* Added finer granularity control for running specific install-check tests.
Bug fixes:
- Fixed bug in K-means allowing use of user-defined metric functions
(MADLIB-874, MADLIB-875).
- Fixed issues related to header files included in the build system
(MADLIB-855, MADLIB-879, MADLIB-884).
Known issues:
- Performance for decision tree with cross-validation is poor on a HAWQ
multi-node system.
--------------------------------------------------------------------------------
MADlib v1.6
Release Date: 2014-June-30
New features:
- Added a new unified 'margins' function that computes marginal effects for
linear, logistic, multilogistic, and cox proportional hazards regression. The
new function also introduces support for interaction terms in the independent
array.
- Updated convergence for 'elastic_net_train' by checking the change in the
loglikelihood instead of the l2-norm of the change in coefficients. This allows
for faster convergence in problems with multiple optimal solutions.
The default threshold for convergence has been reduced from 1e-4 to 1e-6.
- Added a new helper function to convert categorical variables to indicator
variables which can be used directly in regression methods. The function
currently only supports dummy encoding.
- Improved performance for cox proportional hazards: average improvement of
20 fold on GPDB and 2.5 fold on HAWQ.
- Improved performance on ARIMA by 30%.
- Added new functionality to export linear and logistic regression models as a
PMML object. The new module relies on PyXB to create PMML elements.
- Added a function ('array_scalar_add') to 'add' a scalar to an array.
- Added 'numeric' type support for all functions that take 'anyarray' as
argument.
- Made usability and aesthetic enhancements to documentation.
Bug Fixes:
- Prepended python module name to sys.path before executing madlib function
to avoid conflicts with user-defined modules.
- Added a check in K-Means to ensure dimensionality of all data points are
the same and also equal to the dimensionality of any provided initial centroids
(MADLIB-713, MADLIB-789).
- Added a check in multinomial regression to quit early and cleanly if model
size is greater than the maximum permissible memory (MADLIB-667).
- Fixed a minor bug with incorrect column names in the decision trees module
(MADLIB-763).
- Fixed a bug in Kmeans that resulted in incorrect number of centroids for
particular datasets (MADLIB-857).
- Fixed bug when grouping columns have same name as one of the output table
column names (MADLIB-833).
Deprecated Functions:
- Modules profile and quantile have been deprecated in favor of the 'summary'
function.
- Module 'svd_mf' has been deprecated in favor of the improved 'svd' function.
- Functions 'margins_logregr' and 'margins_mlogregr' have been deprecated in
favor of the 'margins' function.
--------------------------------------------------------------------------------
MADlib v1.5
Release Date: 2014-Mar-05
New features:
- Added a new port 'HAWQ'. MADlib can now be used with the Pivotal
Distribution of Hadoop (PHD) through HAWQ
(see http://www.gopivotal.com/big-data/pivotal-hd for more details).
- Implemented performance improvements for linear and logistic predict functions.
- Moved Conditional Random Fields (CRFs) out of early stage development, and
updated the design and APIs for to enable ease of use and better functionality.
API changes include lincrf replaced by lincrf_train, crf_train_fgen and
crf_test_fgen with updated arguments, and format of segment tables.
- Improved linear support vector machines (SVMs) by enabling iterations, and
removed lsvm_predict and svm_predict, which are not useful in GPDB and HAWQ.
- Added new functions, with improved performance compared to svec_sfv, for
document vectorization into sparse vectors.
- Removed the bool-to-text cast and updated all functions depending on it to
explicitly convert variable to text.
- Added function properties for all SQL functions to allow the database optimizer
to make better plans.
Bug Fixes:
- Set client_min_messages to 'notice' during database installation to ensure
that log messages don't get logged to STDERR.
- Fixed elastic net prediction to predict using all features instead of just
the selected features to avoid an error when no feature is selected as relevant
in the trained model.
- For corner probability values, p=0 and p=1, in bernoulli and binomial
distributions, the quantile values should be 0 and num_of_trials (=1 in the case
of bernoulli) respectively, independent of the probability of success.
- Changed install script to explicitly use /bin/bash instead of /bin/sh to avoid
problems in Ubuntu where /bin/sh is linked to 'dash'.
- Fixed issue in Elastic Net to take any array expression as input instead of
specifically expecting the expression 'ARRAY[...]'.
- Fixed wrong output in percentile of count-min (CM) sketches.
Known issues:
- Elastic net prediction wrapper function elastic_net_prediction is not
available in HAWQ. Instead, prediction functionality is available for both
families via elastic_net_gaussian_predict and elastic_net_binomial_predict.
- Distance metrics functions in K-Means for the HAWQ port are restricted to the
in-built functions, specifically squaredDistNorm2, distNorm2, distNorm1,
distAngle, and distTanimoto.
- Functions in Quantile and Profile modules of Early Stage Development are not
available in HAWQ. Replacement of these functions is available as built-in
functions (percentile_cont) in HAWQ and Summary module in MADlib, respectively.
--------------------------------------------------------------------------------
MADlib v1.4.1
Release Date: 2013-Dec-13
Bug Fixes:
- Fixed problem in Elastic Net for 'binomial' family if an 'integer' column was
passed for dependent variable instead of a 'boolean' column.
- '*' support in Elastic Net lacked checks for the columns being combined. Now
we check if the column for '*' is already an array, in which case we don't wrap
it with an 'array' modifier. If there are multiple columns we check that they
are of the same numeric type before building an array.
- Fixed a software regression in Robust Variance, Clustered Variance and
Marginal Effects for multinomial regression introduced in v1.4 when
output table name is schema-qualified.
- We now also support schema-qualified output table prefixes for SVD and PCA.
- Added warning message when deprecated functions are run. Also added a list of
deprecated functions in the ReadMe.
- Added a Markdown Readme along with the text version for better rendering on
Github.
--------------------------------------------------------------------------------
MADlib v1.4
Release Date: 2013-Nov-25
New Features:
* Improved interface for Multinomial logistic regression:
- Added a new interface that accepts an 'output_table' parameter and
stores the model details in the output table instead of returning as a struct
data type. The updated function also builds a summary table that includes
all parameters and meta-parameters used during model training.
- The output table has been reformatted to present the model coefficients
and related metrics for each category in a separate row. This replaces the
old output format of model stats for all categories combined in a
single array.
* Variance Estimators
- Added Robust Variance estimator for Cox PH models (Lin and Wei, 1989).
It is useful in calculating variances in a dataset with potentially
noisy outliers. Namely, the standard errors are asymptotically normal even
if the model is wrong due to outliers.
- Added Clustered Variance estimator for Cox PH models. It is used
when data contains extra clustering information besides covariates and
are asymptotically normal estimates.
* NULL Handling:
- Modified behavior of regression modules to 'omit' rows containing NULL
values for any of the dependent and independent variables. The number of
rows skipped is provided as part of the output table.
This release includes NULL handling for following modules:
- Linear, Logistic, and Multinomial logistic regression, as well as
Cox Proportional Hazards
- Huber-White sandwich estimators for linear, logistic, and multinomial
logistic regression as well as Cox Proportional Hazards
- Clustered variance estimators for linear, logistic, and multinomial
logistic regression as well as Cox Proportional Hazards
- Marginal effects for logistic and multinomial logistic regression
Deprecated functions:
- Multinomial logistic regression function has been renamed to
'mlogregr_train'. Old function ('mlogregr') has been deprecated,
and will be removed in the next major version update.
- For all multinomial regression estimator functions (list given below),
changes in the argument list were made to collate all optimizer specific
arguments in a single string. An example of the new optimizer parameter is
'max_iter=20, optimizer=irls, precision=0.0001'.
This is in contrast to the original argument list that contained 3 arguments:
'max_iter', 'optimizer', and 'precision'. This change allows adding new
optimizer-specific parameters without changing the argument list.
Affected functions:
- robust_variance_mlogregr
- clustered_variance_mlogregr
- margins_mlogregr
Bug Fixes:
- Fixed an overflow problem in LDA by using INT64 instead of INT32.
- Fixed integer to boolean cast bug in clustered variance for logistic
regression. After this fix, integer columns are accepted for binary
dependent variable using the 'integer to bool' cast rules.
- Fixed two bugs in SVD:
- The 'example' option for online help has been fixed
- Column names for sparse input tables in the 'svd_sparse' and
'svd_sparse_native' functions are no longer restricted to 'row_id',
'col_id' and 'value'.
--------------------------------------------------------------------------------
MADlib v1.3
Release Date: 2013-October-03
New Features:
* Cox Proportional Hazards:
- Added stratification support for Cox PH models. Stratification is used as
shorthand for building a Cox model that allows for more than one stratum,
and hence, allows for more than one baseline hazard function.
Stratification provides two pieces of key, flexible functionality for the
end user of Cox models:
-- Allows a categorical variable Z to be appropriately accounted for in
the model without estimating its predictive impact on the response
variable.
-- Categorical variable Z is predictive/associated with the response
variable, but Z may not satisfy the proportional hazards assumption
- Added a new function (cox_zph) that tests the proportional hazards
assumption of a Cox model. This allows the user to build Cox models and then
verify the relevance of the model.
* NULL Handling:
- Modified behavior of linear and logistic regression to 'omit' rows
containing NULL values for any of the dependent and independent variables.
The number of rows skipped is provided as part of the output table.
Deprecated functions:
- Cox Proportional Hazard function has been renamed to 'coxph_train'.
Old function names ('cox_prop_hazards' and 'cox_prop_hazards_regr')
have been deprecated, and will be removed in the next major version update.
- The aggregate form of linear regression ('linregr') has been deprecated.
The stored-procedure form ('linregr_train') should be used instead.
Bug Fixes:
- Fixed a memory leak in the Apriori algorithm.
--------------------------------------------------------------------------------
MADlib v1.2
Release Date: 2013-September-06