MLDB-1927 base version of union dataset #682

FinchPowers · 2016-09-22T19:55:38Z

No description provided.

FinchPowers · 2016-09-22T19:58:21Z

testing/union_dataset_test.py

+            ['_rowName', 'colA', 'colB'],
+            ['[]-[row1]', None, 'B'],
+            ['[row1]-[]', 'A', None]
+        ])


@simlmx c'est ce que ça donne

jeremybarnes · 2016-09-22T21:42:51Z

I think that the row names should be much simpler. Just a two element path with the dataset number and row name from the dataset. The addition of brackets etc makes it hard and expensive to process them.

jeremybarnes · 2016-09-22T21:54:06Z

builtin/union_dataset.cc

+        throw ML::Exception("Row not known");
+    }
+
+    virtual MatrixNamedRow getRow(const RowName & rowName) const


You should implement getRoqExpr instead. Get row is deprecated.

If I mark getRowExpr as override the compiler says it doesn't override anything.

You need to implement it in the dataset, not in the matrix view

jeremybarnes · 2016-09-22T22:09:10Z

builtin/union_dataset.cc

+    }
+
+    virtual vector<RowHash>
+    getRowHashes(ssize_t start = 0, ssize_t limit = -1) const


This method is no longer needed

I have a failing test case if I don't implement it properly.

SELECT * FROM merge(union(ds1, ds2), ds3) ORDER BY rowName() LIMIT 1

jeremybarnes · 2016-09-22T22:10:27Z

builtin/union_dataset.cc

+        vector<std::tuple<RowName, CellValue> > res;
+        for (int i = 0; i < datasets.size(); ++i) {
+            const auto & d = datasets[i];
+            string prefix;


This is going to be terribly slow unless we avoid the need to copy and manipulate strings

mailletf · 2016-09-23T12:30:53Z

I'm agree for much simpler. I though it was going to be datasetName.rowName

Le jeudi 22 septembre 2016, Jeremy Barnes [email protected] a
écrit :

I think that the row names should be much simpler. Just a two element path
with the dataset number and row name from the dataset. The addition of
brackets etc makes it hard and expensive to process them.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#682 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiwTWnr5i6ZYFa8NyTk6btrhByn8KrKks5qsvZbgaJpZM4KES9o
.

François Maillet
Head of Machine Learning | Chef d'équipe apprentissage machine
MLDB.ai - Datacratic.com

guyd · 2016-09-23T12:17:52Z

builtin/union_dataset.cc

+/**                                                                 -*- C++ -*-
+ * union_dataset.cc
+ * Mich, 2016-09-14
+ * This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved.


2015 -> 2016

guyd · 2016-09-23T12:21:00Z

builtin/union_dataset.cc

+    struct UnionRowStream : public RowStream {
+
+        UnionRowStream(const UnionDataset::Itl* source) : source(source)
+        {


it is not initialized...

guyd · 2016-09-23T12:21:15Z

builtin/union_dataset.cc

+
+        virtual RowName next()
+        {
+            uint64_t hash = (*it).first;


guyd · 2016-09-23T12:26:53Z

builtin/union_dataset.h

+/**                                                                 -*- C++ -*-
+ * union_dataset.h
+ * Mich, 2016-09-14
+ * This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved.


2015 -> 2016

guyd · 2016-09-23T12:30:00Z

container_files/public_html/doc/builtin/datasets/UnionDataset.md

+Creating a union dataset is equivalent to the following SQL:
+
+```sql
+SELECT * FROM (SELECT * FROM ds1 ) AS s1 OUTER JOIN (SELECT * FROM ds2) AS s2


Really? Don't you miss ON False?

guyd · 2016-09-23T12:30:40Z

container_files/public_html/doc/builtin/datasets/UnionDataset.md

@@ -0,0 +1,29 @@
+# Union Dataset
+
+The union dataset allows for rows from multiple datasets to be concatenated


I think appended is better here than concatenated.

guyd · 2016-09-23T12:34:12Z

testing/union_dataset_test.py

+
+    @classmethod
+    def setUpClass(cls):
+        ds = mldb.create_dataset({'id' : 'ds1', 'type' : 'sparse.mutable'})


I would test with at least two rows and test with dataset with matching column names. Also, if we expect the result to be equivalent to a JOIN than I would also test that.

guyd · 2016-09-23T12:58:17Z

Regarding the names, if we use the dataset names then what would we do with datasets that have no clear names (e.g. sub-select, merge, etc)? I personally like the positional naming.

…dataset

FinchPowers · 2016-09-23T13:22:53Z

builtin/union_dataset.cc

+    }
+
+    virtual vector<RowHash>
+    getRowHashes(ssize_t start = 0, ssize_t limit = -1) const


If I remove it it complains about the "pure virtual". Should I remove from the code base?

strange... leave it in then, we can address that elsewhere

jeremybarnes · 2016-09-23T15:40:48Z

@guyd we should use positional naming IMO. In general, a dataset could even possibly have multiple names, so we can't use that.

Algorithm for taking a row name from sub-dataset n and converting to rowname in union dataset:

prepend a n to the path. So if it was already x.y, it becomes n.x.y

Algorithm for finding which dataset and rowname a row in the union dataset belongs to:

pop the first element off the front, that gives the dataset number. The rest of the path gives the rowname in that dataset

guyd · 2016-09-23T17:01:49Z

I agree that having the row name with the dataset index as a prefix leads to easier algorithms than the row name with the brackets.

FinchPowers · 2016-09-23T18:12:45Z

I updated the rowNames. Note that the positional naming was coherent with the joins.

FinchPowers · 2016-09-23T20:56:00Z

Can someone give me an example where the following Itl functions are called:

knownRowHash
getRow
rowName

I ran my tests with throw/cerr and they never seem to be called in my use cases. I would like to add test to cover them.

Also, getRowStream is called, but the result never seems to be used, nor it would crash since it is not implemented properly.

Are we due to refactor the dataset interface?

…dataset Conflicts: testing/testing.mk

FinchPowers · 2016-09-27T13:31:51Z

builtin/union_dataset.cc

+struct UnionDataset::Itl
+    : public MatrixView, public ColumnIndex {
+
+    map<RowHash, pair<int, RowHash> > rowIndex;


Very likely slow. Is it acceptable for the current version?

You should at least use a ML::Lightweight_Hash; the interface is basically the same.

jeremybarnes · 2016-09-27T14:21:14Z

knownRowHash may not be used at all.
getRowStream is called when running SQL queries over the dataset. @Steadtler has more details.
Yes, we are due to refactor the Dataset interface, but that's not really user visible and so hard to think it's worth spending the time until we know it's blocking something that we need to do.

…dataset

jeremybarnes · 2016-09-28T17:40:27Z

builtin/union_dataset.cc

+
+        UnionRowStream(const UnionDataset::Itl* source) : source(source)
+        {
+            cerr << "UNIMPLEMENTED " << __FILE__ << ":" << __LINE__ << endl;


This should be implemented and tested.

Or at the very least, it should throw that it's unimplemneted not silently do the wrong thing.

It is called, but the result is unused. So I need to implement it, but it has no effect.

jeremybarnes · 2016-09-28T17:41:47Z

builtin/union_dataset.cc

+        return false;
+    }
+
+    virtual ColumnName getColumnName(ColumnHash columnHash) const


Do we actually need to implement that?

Yes, this is a pure virtual in the base class.

I can at least enhance that implementation.

jeremybarnes · 2016-09-28T17:45:13Z

builtin/union_dataset.cc

+        if (rowName.size() < 2) {
+            return false;
+        }
+        string idxStr = (*(rowName.begin())).toUtf8String().rawString();


NOOOO! it's horrible to allocate strings for that! Use rowName.at(0).requireIndex() instead.

jeremybarnes · 2016-09-28T17:45:40Z

builtin/union_dataset.cc

+            auto columnNames = d->getColumnNames();
+            preResult.insert(columnNames.begin(), columnNames.end());
+        }
+        return vector<ColumnName>(preResult.begin(), preResult.end());


No, they need to be de-duplicated.

That's why I store them in a std::set before copying them to a std::vector.

Ah, OK, true

jeremybarnes · 2016-09-28T17:47:11Z

builtin/union_dataset.cc

+            return false;
+        }
+        return datasets[idx]->getMatrixView()->knownRow(
+            Path(rowName.begin() + 1, rowName.end()));


Use rowName.tail() instead.

jeremybarnes

I would also have expected that ds1 UNION ds2 would work, not just union(ds1, ds2). The first is standard SQL, the second is not.

FinchPowers · 2016-09-28T18:02:10Z

the "standard SQL UNION" you refer to is our "merge" operator in fact. I agree that this is misleading and we should find a new name for the "union" operator of the current PR.

FinchPowers · 2016-09-28T18:08:08Z

butting - verb (used with object)

to place or join the ends (of two things) together; set end-to-end.

FinchPowers · 2016-09-28T18:25:01Z

Ok, we stick with union. I'll drop the "union()" operator and create the "UNION" key word.

jeremybarnes · 2016-09-28T19:07:41Z

SQL UNION is the same as our UNION. If you take a union of 5 datasets of 1000 rows, you have 5,000 rows. That's true for UNION but not for MERGED.

FinchPowers · 2016-09-29T15:28:26Z

@jeremybarnes changes applied.

No more "union()" function.
UNION keyword supported.

…dataset Conflicts: testing/testing.mk

…dataset

jeremybarnes · 2016-10-17T17:11:40Z

The unionIndex parameter is just wrong. It leaks the detail of an internal dataset to the main index. Why oh why do we need to do this?

jeremybarnes · 2016-10-17T17:19:21Z

builtin/union_dataset.cc

+    int getIdxFromRowName(const RowName & rowName) const {
+        // Returns idx > -1 if the index is valid, -1 otherwise
+        if (rowName.size() < 2) {
+            return false;


false = 0? Don't you mean -1? I would probably throw an exception

jeremybarnes · 2016-10-17T17:20:23Z

builtin/union_dataset.cc

+            return false;
+        }
+        int idx = static_cast<int>(rowName.at(0).toIndex());
+        if (idx > datasets.size()) {


And if idx < -1?

Then -1 is returned.

jeremybarnes · 2016-10-17T17:21:38Z

builtin/union_dataset.cc

+
+    struct UnionRowStream : public RowStream {
+
+        UnionRowStream(const UnionDataset::Itl* source) : source(source)


This will be called as soon as someone uses it in the wild, eg to train a random forest classifier.

jeremybarnes · 2016-10-17T17:22:52Z

builtin/union_dataset.cc

+    }
+
+    // DEPRECATED
+    virtual MatrixNamedRow getRow(const RowName & rowName) const


You should simply not implement this method. Then where it's used the Dataset will take care of emulating it. Otherwise we will get crashes still.

It's a pure virtual in the base class. I have to implement it.

jeremybarnes · 2016-10-17T17:26:32Z

builtin/union_dataset.cc

+            throw ML::Exception("Row not known");
+        }
+        const auto & idxAndHash = it->second;
+        return datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second);


@FinchPowers this is where the problem is. getRowName() should return the row name including the index. So this should be something more like

PathBuilder builder; builder.add(idxAndHash.second); Path subRowName = datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second); builder.add(subRowName, 0, subRowName.size()); return builder.extract();

jeremybarnes · 2016-10-17T17:27:27Z

builtin/union_dataset.cc

+            try {
+                return d->getMatrixView()->getColumnName(columnHash);
+            }
+            catch (const ML::Exception & exc) {


!!! we shouldn't ever have a try/catch unconditional like that. Instead, we should use knownColumn

…dataset Conflicts: sql/sql_expression.cc sql/sql_expression.h

MLDB-1927 base version of union dataset

fe40df7

FinchPowers added the feedback label Sep 22, 2016

Enh test

62312a6

FinchPowers commented Sep 22, 2016

View reviewed changes

François-Michel L'Heureux added 4 commits September 22, 2016 19:58

Missing files

3b3ca27

builtin.mk update, minor enh, + tests

2fd424a

MLDB-1945 union operator

9b0a204

union dataset doc

b8a16eb

jeremybarnes reviewed Sep 22, 2016

View reviewed changes

guyd reviewed Sep 23, 2016

View reviewed changes

FinchPowers added Not for general review and removed feedback labels Sep 23, 2016

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

b301bce

…dataset

FinchPowers commented Sep 23, 2016

View reviewed changes

François-Michel L'Heureux added 2 commits September 23, 2016 13:36

more tests

bfbd142

added test, code notes

65c970b

rowName idx dot rowName vs square brackets

8459f83

François-Michel L'Heureux added 3 commits September 26, 2016 14:56

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

2c47387

…dataset Conflicts: testing/testing.mk

wip, rowIndex supported

7ef5eca

Union works with merge

bae35b4

FinchPowers commented Sep 27, 2016

View reviewed changes

FinchPowers removed the Not for general review label Sep 27, 2016

François-Michel L'Heureux added 5 commits September 27, 2016 15:37

Using lightweight hash

6d65755

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

673c7f8

…dataset

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

a5476a7

…dataset

upd test following merge, upd doc

6f58dfe

more tests

2f6103a

jeremybarnes reviewed Sep 28, 2016

View reviewed changes

jeremybarnes requested changes Sep 28, 2016

View reviewed changes

François-Michel L'Heureux added 2 commits September 29, 2016 15:25

sql UNION keyword

02b128e

Rm createUnionDatasetFn

aa18c9d

François-Michel L'Heureux added 2 commits October 3, 2016 14:43

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

5e52317

…dataset Conflicts: testing/testing.mk

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

46dcc9b

…dataset

jeremybarnes reviewed Oct 17, 2016

View reviewed changes

Merge branch 'master' of github.com:mldbai/mldb into MLDB-1927_union_…

96cf9b7

…dataset Conflicts: sql/sql_expression.cc sql/sql_expression.h

		@@ -0,0 +1,29 @@
		# Union Dataset

		The union dataset allows for rows from multiple datasets to be concatenated


		struct UnionRowStream : public RowStream {

		UnionRowStream(const UnionDataset::Itl* source) : source(source)

MLDB-1927 base version of union dataset #682

Are you sure you want to change the base?

MLDB-1927 base version of union dataset #682

Conversation

FinchPowers commented Sep 22, 2016

Choose a reason for hiding this comment

jeremybarnes commented Sep 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mailletf commented Sep 23, 2016

Choose a reason for hiding this comment

guyd Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

guyd Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guyd Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

guyd Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

guyd commented Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremybarnes commented Sep 23, 2016

guyd commented Sep 23, 2016

FinchPowers commented Sep 23, 2016

FinchPowers commented Sep 23, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremybarnes commented Sep 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremybarnes left a comment

Choose a reason for hiding this comment

FinchPowers commented Sep 28, 2016

FinchPowers commented Sep 28, 2016

butting - verb (used with object)

FinchPowers commented Sep 28, 2016

jeremybarnes commented Sep 28, 2016

FinchPowers commented Sep 29, 2016

jeremybarnes commented Oct 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guyd Sep 23, 2016 •

edited

Loading

guyd Sep 23, 2016 •

edited

Loading

guyd Sep 23, 2016 •

edited

Loading

guyd Sep 23, 2016 •

edited

Loading

guyd commented Sep 23, 2016 •

edited

Loading

FinchPowers commented Sep 23, 2016 •

edited

Loading