Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLDB-1927 base version of union dataset #682

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

FinchPowers
Copy link
Member

No description provided.

['_rowName', 'colA', 'colB'],
['[]-[row1]', None, 'B'],
['[row1]-[]', 'A', None]
])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simlmx c'est ce que ça donne

@jeremybarnes
Copy link
Contributor

I think that the row names should be much simpler. Just a two element path with the dataset number and row name from the dataset. The addition of brackets etc makes it hard and expensive to process them.

throw ML::Exception("Row not known");
}

virtual MatrixNamedRow getRow(const RowName & rowName) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should implement getRoqExpr instead. Get row is deprecated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I mark getRowExpr as override the compiler says it doesn't override anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to implement it in the dataset, not in the matrix view

}

virtual vector<RowHash>
getRowHashes(ssize_t start = 0, ssize_t limit = -1) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is no longer needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a failing test case if I don't implement it properly.

 SELECT * FROM merge(union(ds1, ds2), ds3)
 ORDER BY rowName() LIMIT 1

vector<std::tuple<RowName, CellValue> > res;
for (int i = 0; i < datasets.size(); ++i) {
const auto & d = datasets[i];
string prefix;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be terribly slow unless we avoid the need to copy and manipulate strings

@mailletf
Copy link
Member

I'm agree for much simpler. I though it was going to be datasetName.rowName

Le jeudi 22 septembre 2016, Jeremy Barnes [email protected] a
écrit :

I think that the row names should be much simpler. Just a two element path
with the dataset number and row name from the dataset. The addition of
brackets etc makes it hard and expensive to process them.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#682 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiwTWnr5i6ZYFa8NyTk6btrhByn8KrKks5qsvZbgaJpZM4KES9o
.

François Maillet
Head of Machine Learning | Chef d'équipe apprentissage machine
MLDB.ai - Datacratic.com

/** -*- C++ -*-
* union_dataset.cc
* Mich, 2016-09-14
* This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2015 -> 2016

struct UnionRowStream : public RowStream {

UnionRowStream(const UnionDataset::Itl* source) : source(source)
{
Copy link
Contributor

@guyd guyd Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not initialized...


virtual RowName next()
{
uint64_t hash = (*it).first;
Copy link
Contributor

@guyd guyd Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... boom

/** -*- C++ -*-
* union_dataset.h
* Mich, 2016-09-14
* This file is part of MLDB. Copyright 2015 Datacratic. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2015 -> 2016

Creating a union dataset is equivalent to the following SQL:

```sql
SELECT * FROM (SELECT * FROM ds1 ) AS s1 OUTER JOIN (SELECT * FROM ds2) AS s2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Don't you miss ON False?

@@ -0,0 +1,29 @@
# Union Dataset

The union dataset allows for rows from multiple datasets to be concatenated
Copy link
Contributor

@guyd guyd Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think appended is better here than concatenated.


@classmethod
def setUpClass(cls):
ds = mldb.create_dataset({'id' : 'ds1', 'type' : 'sparse.mutable'})
Copy link
Contributor

@guyd guyd Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would test with at least two rows and test with dataset with matching column names. Also, if we expect the result to be equivalent to a JOIN than I would also test that.

@guyd
Copy link
Contributor

guyd commented Sep 23, 2016

Regarding the names, if we use the dataset names then what would we do with datasets that have no clear names (e.g. sub-select, merge, etc)? I personally like the positional naming.

}

virtual vector<RowHash>
getRowHashes(ssize_t start = 0, ssize_t limit = -1) const
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove it it complains about the "pure virtual". Should I remove from the code base?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange... leave it in then, we can address that elsewhere

@jeremybarnes
Copy link
Contributor

@guyd we should use positional naming IMO. In general, a dataset could even possibly have multiple names, so we can't use that.

Algorithm for taking a row name from sub-dataset n and converting to rowname in union dataset:

  • prepend a n to the path. So if it was already x.y, it becomes n.x.y

Algorithm for finding which dataset and rowname a row in the union dataset belongs to:

  • pop the first element off the front, that gives the dataset number. The rest of the path gives the rowname in that dataset

@guyd
Copy link
Contributor

guyd commented Sep 23, 2016

I agree that having the row name with the dataset index as a prefix leads to easier algorithms than the row name with the brackets.

@FinchPowers
Copy link
Member Author

I updated the rowNames. Note that the positional naming was coherent with the joins.

@FinchPowers
Copy link
Member Author

FinchPowers commented Sep 23, 2016

Can someone give me an example where the following Itl functions are called:

  • knownRowHash
  • getRow
  • rowName

I ran my tests with throw/cerr and they never seem to be called in my use cases. I would like to add test to cover them.

Also, getRowStream is called, but the result never seems to be used, nor it would crash since it is not implemented properly.

Are we due to refactor the dataset interface?

struct UnionDataset::Itl
: public MatrixView, public ColumnIndex {

map<RowHash, pair<int, RowHash> > rowIndex;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very likely slow. Is it acceptable for the current version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least use a ML::Lightweight_Hash; the interface is basically the same.

@jeremybarnes
Copy link
Contributor

knownRowHash may not be used at all.
getRowStream is called when running SQL queries over the dataset. @Steadtler has more details.
Yes, we are due to refactor the Dataset interface, but that's not really user visible and so hard to think it's worth spending the time until we know it's blocking something that we need to do.


UnionRowStream(const UnionDataset::Itl* source) : source(source)
{
cerr << "UNIMPLEMENTED " << __FILE__ << ":" << __LINE__ << endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented and tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the very least, it should throw that it's unimplemneted not silently do the wrong thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called, but the result is unused. So I need to implement it, but it has no effect.

return false;
}

virtual ColumnName getColumnName(ColumnHash columnHash) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to implement that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a pure virtual in the base class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can at least enhance that implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (rowName.size() < 2) {
return false;
}
string idxStr = (*(rowName.begin())).toUtf8String().rawString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOOOO! it's horrible to allocate strings for that! Use rowName.at(0).requireIndex() instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto columnNames = d->getColumnNames();
preResult.insert(columnNames.begin(), columnNames.end());
}
return vector<ColumnName>(preResult.begin(), preResult.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they need to be de-duplicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I store them in a std::set before copying them to a std::vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, true

return false;
}
return datasets[idx]->getMatrixView()->knownRow(
Path(rowName.begin() + 1, rowName.end()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use rowName.tail() instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@jeremybarnes jeremybarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also have expected that ds1 UNION ds2 would work, not just union(ds1, ds2). The first is standard SQL, the second is not.

@FinchPowers
Copy link
Member Author

the "standard SQL UNION" you refer to is our "merge" operator in fact. I agree that this is misleading and we should find a new name for the "union" operator of the current PR.

@FinchPowers
Copy link
Member Author

butting - verb (used with object)

to place or join the ends (of two things) together; set end-to-end.

@FinchPowers
Copy link
Member Author

Ok, we stick with union. I'll drop the "union()" operator and create the "UNION" key word.

@jeremybarnes
Copy link
Contributor

SQL UNION is the same as our UNION. If you take a union of 5 datasets of 1000 rows, you have 5,000 rows. That's true for UNION but not for MERGED.

@FinchPowers
Copy link
Member Author

@jeremybarnes changes applied.

  • No more "union()" function.
  • UNION keyword supported.

@jeremybarnes
Copy link
Contributor

The unionIndex parameter is just wrong. It leaks the detail of an internal dataset to the main index. Why oh why do we need to do this?

int getIdxFromRowName(const RowName & rowName) const {
// Returns idx > -1 if the index is valid, -1 otherwise
if (rowName.size() < 2) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false = 0? Don't you mean -1? I would probably throw an exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

return false;
}
int idx = static_cast<int>(rowName.at(0).toIndex());
if (idx > datasets.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if idx < -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then -1 is returned.


struct UnionRowStream : public RowStream {

UnionRowStream(const UnionDataset::Itl* source) : source(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be called as soon as someone uses it in the wild, eg to train a random forest classifier.

}

// DEPRECATED
virtual MatrixNamedRow getRow(const RowName & rowName) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should simply not implement this method. Then where it's used the Dataset will take care of emulating it. Otherwise we will get crashes still.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pure virtual in the base class. I have to implement it.

throw ML::Exception("Row not known");
}
const auto & idxAndHash = it->second;
return datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FinchPowers this is where the problem is. getRowName() should return the row name including the index. So this should be something more like

PathBuilder builder;
builder.add(idxAndHash.second);
Path subRowName = datasets[idxAndHash.first]->getMatrixView()->getRowName(idxAndHash.second); 
builder.add(subRowName, 0, subRowName.size());
return builder.extract();

try {
return d->getMatrixView()->getColumnName(columnHash);
}
catch (const ML::Exception & exc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!! we shouldn't ever have a try/catch unconditional like that. Instead, we should use knownColumn

…dataset

Conflicts:
	sql/sql_expression.cc
	sql/sql_expression.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants