-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column
API mvp
#100
Merged
Merged
Add column
API mvp
#100
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* added some docstrings * re-organized a little
Add initial column API functions
* Add ->general-types function * Add a general type :logical * Use type hierarchy in tablecloth.api.utils for `typeof` functions * Add column dev branch to pr workflow * Add tests for typeof * Fix tests for typeof * Return the concrete type from `typeof` * Simplify `concrete-types` fn * Optimize ->general-types by using static lookup * Adjust fns listing types * We decided that the default meaning of type points to the "concrete" type, and not the general type. * So `types` now returns the set of concrete types and `general-types` returns the general types. * Revert "Adjust fns listing types" This reverts commit d93e34f. * Fix `typeof` test to test for concerete types * Reorganize `typeof?` tests * Reword docstring for `typeof?` slightly * Update column api template and add missing `typeof?` * Add commment to `general-types-lookup` * Improve `->general-types` docstring * Add `general-types` fn that returns sets of general types * Adjust util `types` fn to return concrete types
* Add ->general-types function * Add a general type :logical * Use type hierarchy in tablecloth.api.utils for `typeof` functions * Add column dev branch to pr workflow * Add tests for typeof * Fix tests for typeof * Return the concrete type from `typeof` * Simplify `concrete-types` fn * Optimize ->general-types by using static lookup * Adjust fns listing types * We decided that the default meaning of type points to the "concrete" type, and not the general type. * So `types` now returns the set of concrete types and `general-types` returns the general types. * Revert "Adjust fns listing types" This reverts commit d93e34f. * Fix `typeof` test to test for concerete types * Reorganize `typeof?` tests * Reword docstring for `typeof?` slightly * Update column api template and add missing `typeof?` * Add commment to `general-types-lookup` * Improve `->general-types` docstring * Add `general-types` fn that returns sets of general types * Adjust util `types` fn to return concrete types * Save changes to column api.clj * Save ongoing experiments with lifting * Save ongoing work on lifting * Adjust lift-ops-1 to handle any number of args with rest arg * Working `rearrange-args` fn * Save work actually writing lifted fns * Saving first attempt to writer operators * Add `percentiiles test * Adjust `rearrange-args to take new-args in option map * Unify two lift functions * Add in docstrings when present * Move lift utils into utils ns * Rename lifting namespaces * Lift some more fns * Make exclusions for ns header helper an arg * Add new operators and tests * Add ops with lhs rhs arg pattern * Lift '* * Add require to operators ns for utils * Update test to make it more complete * Lift `equals * Make test more accurate * Reorganize tests * Fix grammar * Lift 'shift * Uncomment 'or test * Lift 'normalize op * Life 'magnitude * Lifting bit manipulation ops * lift ieee-remainder * Lifting more functions * Add excludes * Lift a bunch of new functions * Alphebetize some lists * More alphebitization * Clean up * Instead of using `col` as arg conform to using `x & and `y * Temporarily disable failing test fix in 7.000-beta23 * Disable the correct test * Just some minor cleanup in op tests * Some more cleanup/reorg in op tests * Update generated operators namespace with switch from col -> x etc * Lift 'descriptive-statistics * Fix messed up test layout * Lift 'quartiles * Lift 'fill-range and a bunch of reduce operations * Lift 'mean-fast 'sum-fast 'magnitude-squared * Lift correlation fns kendalls, pearsons, and spearmans * Lift cumulative ops * cleanup
* Upgrade to latest clay version * Show using tablecloth.column.api.operators ns * Cleanup whitespace
* Fix indentation * Save rough working example Not fully tested * Fix tests for new aggregator form of ops that return scalar
* Add a sample notebook file * Save draft work on column api doc * Add doc entry for tcc/select boolean select This appears to be broken now, but ti shouldn't be. * Export column api operators in column api ns * Add in some documentation of operations * Hide namespace expression from generated doc * Fix circular dependency * Update generated docs * Update text in colum operations section * More updates to the docs * Remove "Functionality" header in TOC This way Dataset is an entry, and I can add Column after that. * Add Column API documentation * Add an indication of column op signature to docs * Export lifted column operators in dataset api template * Add documentation for column operations on datasets * Some minor changes * Rename the two headers for Dataset and Column, adding API onto the end. * A few small fixes. * Remove the `Functions` section This is essentially replaced by the Column API that lifts these functions into Tablecloth * Try to remove cyclical dependency * Revert "Try to remove cyclical dependency" This reverts commit fcb16c4. * Fix circular dependency * Actually fix cyclical dependency * Undo added line
ezmiller
commented
Feb 5, 2024
.github/workflows/prs.yml
Outdated
@@ -4,6 +4,7 @@ on: | |||
pull_request: | |||
branches: | |||
- master | |||
- ethan/column-api-dev-branch-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove before we merge this branch.
closing to reopen for testing doc preview |
|
Default was gh-pages, we use master.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goal
This PR adds a new
column
API to tablecloth.Overview
ThIs PR adds a new
column
API to tablecloth. It also lifts a new set of functions into the existingdataset
API that make it possible to run the new column operations on columns in a dataset. There are a few core concepts to the new column API:There is now such a thing as a
column
alongside thedataset
in Tablecloth. Thiscolumn
is just the tech.ml.datasetColumn
that constitutes a dataset, but the new Column API makes it a basic primitive in Tablecloth.While there are some special functions within this new API that help with questions about the identity of a column and the typing of its elements, the bulk of the new code here is generated code that wraps operations functions already present in dtype-next's
tech.v3.datatype.functional
namespace. In lifting these functions, the resulting functions have two main characteristics: 1) They take acolumn
as the first argument always, just like the functions in Tablecloth's dataset API, and 2) they always return a new column.In addition to adding the new column API, this PR also adds a new set of operators to Tablecloth's Dataset API:
tablecloth.api.operators
. These functions allow the column operations to be easily performed on datasets. These may end up being the most commonly parts of this PR simply because they are convenient, adding expressiveness to dataset manipulations. Here's a screenshot from @kiramclean's excellent 2023 Clojure Conj talk on "Clojure for Data Science in the Real world" that sums it up well:Details of the implementation
There are a great many lines of code in this PR, but the bulk of the changes are actually made via code generation tools that were built to "lift" these tools. The utilities for this process are located in
src/tablecloth/utils/codegen.clj
. Then for each of the APIs where we are doing lifting the utilities are used to generate the two operators namespaces insrc/tablecloth/api/lift_operators.clj
andsrc/tablecloth/column/api/lift_operators.clj
. Those two namespaces contain the functions that actually describe the functions generated.Right now to regenerate these namespaces, we need to manually run chunks of code that are commented out at the bottom of those two lift namespaces. Going forward we should consider automating these processes in github actions.
Open Questions