-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLDB-1927 base version of union dataset #682
base: master
Are you sure you want to change the base?
Conversation
['_rowName', 'colA', 'colB'], | ||
['[]-[row1]', None, 'B'], | ||
['[row1]-[]', 'A', None] | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simlmx c'est ce que ça donne
I think that the row names should be much simpler. Just a two element path with the dataset number and row name from the dataset. The addition of brackets etc makes it hard and expensive to process them. |
builtin/union_dataset.cc
Outdated
throw ML::Exception("Row not known"); | ||
} | ||
|
||
virtual MatrixNamedRow getRow(const RowName & rowName) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should implement getRoqExpr instead. Get row is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I mark getRowExpr as override the compiler says it doesn't override anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to implement it in the dataset, not in the matrix view
} | ||
|
||
virtual vector<RowHash> | ||
getRowHashes(ssize_t start = 0, ssize_t limit = -1) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a failing test case if I don't implement it properly.
SELECT * FROM merge(union(ds1, ds2), ds3)
ORDER BY rowName() LIMIT 1
builtin/union_dataset.cc
Outdated
vector<std::tuple<RowName, CellValue> > res; | ||
for (int i = 0; i < datasets.size(); ++i) { | ||
const auto & d = datasets[i]; | ||
string prefix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be terribly slow unless we avoid the need to copy and manipulate strings
I'm agree for much simpler. I though it was going to be datasetName.rowName Le jeudi 22 septembre 2016, Jeremy Barnes [email protected] a
François Maillet |
builtin/union_dataset.cc
Outdated
/** -*- C++ -*- | ||
* union_dataset.cc | ||
* Mich, 2016-09-14 | ||
* This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2015 -> 2016
struct UnionRowStream : public RowStream { | ||
|
||
UnionRowStream(const UnionDataset::Itl* source) : source(source) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it
is not initialized...
|
||
virtual RowName next() | ||
{ | ||
uint64_t hash = (*it).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... boom
builtin/union_dataset.h
Outdated
/** -*- C++ -*- | ||
* union_dataset.h | ||
* Mich, 2016-09-14 | ||
* This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2015 -> 2016
Creating a union dataset is equivalent to the following SQL: | ||
|
||
```sql | ||
SELECT * FROM (SELECT * FROM ds1 ) AS s1 OUTER JOIN (SELECT * FROM ds2) AS s2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Don't you miss ON False
?
@@ -0,0 +1,29 @@ | |||
# Union Dataset | |||
|
|||
The union dataset allows for rows from multiple datasets to be concatenated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think appended is better here than concatenated.
|
||
@classmethod | ||
def setUpClass(cls): | ||
ds = mldb.create_dataset({'id' : 'ds1', 'type' : 'sparse.mutable'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would test with at least two rows and test with dataset with matching column names. Also, if we expect the result to be equivalent to a JOIN than I would also test that.
Regarding the names, if we use the dataset names then what would we do with datasets that have no clear names (e.g. sub-select, merge, etc)? I personally like the positional naming. |
} | ||
|
||
virtual vector<RowHash> | ||
getRowHashes(ssize_t start = 0, ssize_t limit = -1) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove it it complains about the "pure virtual". Should I remove from the code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange... leave it in then, we can address that elsewhere
@guyd we should use positional naming IMO. In general, a dataset could even possibly have multiple names, so we can't use that. Algorithm for taking a row name from sub-dataset n and converting to rowname in union dataset:
Algorithm for finding which dataset and rowname a row in the union dataset belongs to:
|
I agree that having the row name with the dataset index as a prefix leads to easier algorithms than the row name with the brackets. |
I updated the rowNames. Note that the positional naming was coherent with the joins. |
Can someone give me an example where the following Itl functions are called:
I ran my tests with throw/cerr and they never seem to be called in my use cases. I would like to add test to cover them. Also, Are we due to refactor the dataset interface? |
…dataset Conflicts: testing/testing.mk
builtin/union_dataset.cc
Outdated
struct UnionDataset::Itl | ||
: public MatrixView, public ColumnIndex { | ||
|
||
map<RowHash, pair<int, RowHash> > rowIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very likely slow. Is it acceptable for the current version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should at least use a ML::Lightweight_Hash; the interface is basically the same.
knownRowHash may not be used at all. |
|
||
UnionRowStream(const UnionDataset::Itl* source) : source(source) | ||
{ | ||
cerr << "UNIMPLEMENTED " << __FILE__ << ":" << __LINE__ << endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented and tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at the very least, it should throw that it's unimplemneted not silently do the wrong thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called, but the result is unused. So I need to implement it, but it has no effect.
builtin/union_dataset.cc
Outdated
return false; | ||
} | ||
|
||
virtual ColumnName getColumnName(ColumnHash columnHash) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to implement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a pure virtual in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can at least enhance that implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
builtin/union_dataset.cc
Outdated
if (rowName.size() < 2) { | ||
return false; | ||
} | ||
string idxStr = (*(rowName.begin())).toUtf8String().rawString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOOOO! it's horrible to allocate strings for that! Use rowName.at(0).requireIndex() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
builtin/union_dataset.cc
Outdated
auto columnNames = d->getColumnNames(); | ||
preResult.insert(columnNames.begin(), columnNames.end()); | ||
} | ||
return vector<ColumnName>(preResult.begin(), preResult.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they need to be de-duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I store them in a std::set before copying them to a std::vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, true
builtin/union_dataset.cc
Outdated
return false; | ||
} | ||
return datasets[idx]->getMatrixView()->knownRow( | ||
Path(rowName.begin() + 1, rowName.end())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use rowName.tail() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also have expected that ds1 UNION ds2 would work, not just union(ds1, ds2). The first is standard SQL, the second is not.
the "standard SQL UNION" you refer to is our "merge" operator in fact. I agree that this is misleading and we should find a new name for the "union" operator of the current PR. |
butting - verb (used with object)to place or join the ends (of two things) together; set end-to-end. |
Ok, we stick with union. I'll drop the "union()" operator and create the "UNION" key word. |
SQL UNION is the same as our UNION. If you take a union of 5 datasets of 1000 rows, you have 5,000 rows. That's true for UNION but not for MERGED. |
@jeremybarnes changes applied.
|
…dataset Conflicts: testing/testing.mk
The |
builtin/union_dataset.cc
Outdated
int getIdxFromRowName(const RowName & rowName) const { | ||
// Returns idx > -1 if the index is valid, -1 otherwise | ||
if (rowName.size() < 2) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false = 0? Don't you mean -1? I would probably throw an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
return false; | ||
} | ||
int idx = static_cast<int>(rowName.at(0).toIndex()); | ||
if (idx > datasets.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if idx < -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then -1 is returned.
|
||
struct UnionRowStream : public RowStream { | ||
|
||
UnionRowStream(const UnionDataset::Itl* source) : source(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be called as soon as someone uses it in the wild, eg to train a random forest classifier.
builtin/union_dataset.cc
Outdated
} | ||
|
||
// DEPRECATED | ||
virtual MatrixNamedRow getRow(const RowName & rowName) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should simply not implement this method. Then where it's used the Dataset will take care of emulating it. Otherwise we will get crashes still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pure virtual in the base class. I have to implement it.
builtin/union_dataset.cc
Outdated
throw ML::Exception("Row not known"); | ||
} | ||
const auto & idxAndHash = it->second; | ||
return datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FinchPowers this is where the problem is. getRowName() should return the row name including the index. So this should be something more like
PathBuilder builder;
builder.add(idxAndHash.second);
Path subRowName = datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second);
builder.add(subRowName, 0, subRowName.size());
return builder.extract();
builtin/union_dataset.cc
Outdated
try { | ||
return d->getMatrixView()->getColumnName(columnHash); | ||
} | ||
catch (const ML::Exception & exc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!! we shouldn't ever have a try/catch unconditional like that. Instead, we should use knownColumn
…dataset Conflicts: sql/sql_expression.cc sql/sql_expression.h
No description provided.