Skip to content

Commit

Permalink
Merge pull request #282 from queryverse/naoperators
Browse files Browse the repository at this point in the history
Add @dropna, @replacena and @dissallowna
  • Loading branch information
davidanthoff authored Dec 23, 2019
2 parents 198c050 + 642d1e9 commit a4c0ea1
Show file tree
Hide file tree
Showing 7 changed files with 244 additions and 6 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/jlpkgbutler-ci-master-workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
julia-version: [1.0.5, 1.1.1, 1.2.0, 1.3.0]
julia-version: [1.3.0]
julia-arch: [x64, x86]
os: [ubuntu-latest, windows-latest, macOS-latest]
exclude:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/jlpkgbutler-ci-pr-workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
julia-version: [1.0.5, 1.1.1, 1.2.0, 1.3.0]
julia-version: [1.3.0]
julia-arch: [x64, x86]
os: [ubuntu-latest, windows-latest, macOS-latest]
exclude:
Expand Down
6 changes: 3 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"

[compat]
IterableTables = "0.8.2, 0.9, 0.10, 0.11, 1"
julia = "1"
julia = "1.3"
QueryOperators = "0.9.1"
DataValues = "0.4.4"
MacroTools = "0.4.4"
DataValues = "0.4.4"
MacroTools = "0.4.4, 0.5"

[targets]
test = ["Statistics", "Test", "DataFrames"]
165 changes: 165 additions & 0 deletions docs/src/standalonequerycommands.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,3 +365,168 @@ println(q)
│ 2 │ Banana │ 6 │ 10.0 │ false │
│ 3 │ Cherry │ 1000 │ 1000.8 │ false │
```

## The `@dropna` command

The `@dropna` command has the form `source |> @dropna(columns...)`. `source` can be any source that can be queried and that has a table structure. If `@dropna()` is called without any arguments, it will drop any row from `source` that has a missing `NA` value in _any_ of its columns. Alternatively one can pass a list of column names to `@dropna`, in which case it will only drop rows that have a `NA` value in one of those columns.

Our first example uses the simple version of `@dropna()` that drops rows that have a missing value in any column:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,2,3], b=[4,missing,5])
q = df |> @dropna() |> DataFrame
println(q)
# output
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 5 │
```

The next example only drops rows that have a missing value in the `b` column:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,2,3], b=[4,missing,5])
q = df |> @dropna(:b) |> DataFrame
println(q)
# output
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 5 │
```

We can specify as many columns as we want:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,2,3], b=[4,missing,5])
q = df |> @dropna(:b, :a) |> DataFrame
println(q)
# output
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 5 │
```

## The `@dissallowna` command

The `@dissallowna` command has the form `source |> @dissallowna(columns...)`. `source` can be any source that can be queried and that has a table structure. If `@dissallowna()` is called without any arguments, it will check that there are no missing `NA` values in any column in any row of the input table and convert the element type of each column to one that cannot hold missing values. Alternatively one can pass a list of column names to `@dissallowna`, in which case it will only check for `NA` values in those columns, and only convert those columns to a type that cannot hold missing values.

Our first example uses the simple version of `@dissallowna()` that makes sure there are no missing values anywhere in the table. Note how the column type for column `a` is changed to `Int64` in this example, i.e. an element type that does not support missing values:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,missing,3], b=[4,5,6])
q = df |> @filter(!isna(_.a)) |> @dissallowna() |> DataFrame
println(q)
# output
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 6 │
```

The next example only checks the `b` column for missing values:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,2,missing], b=[4,missing,5])
q = df |> @filter(!isna(_.b)) |> @dissallowna(:b) |> DataFrame
println(q)
# output
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64⍰ │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
```

## The `@replacena` command

The `@replacena` command has a simple and full version.

The simple form is `source |> @replacena(replacement_value)`. `source` can be any source that can be queried and that has a table structure. In this case all missing `NA` values in the source table will be replaced with `replacement_value`. Not that this version only works properly, if all columns that contain missing values have the same element type.

The full version has the form `source |> @replacena(replacement_specifier...)`. `source` can again be any source that can be queried that has a table structure. Each `replacement_specifier` should be a `Pair` of the form `column_name => replacement_value`. For example `:b => 3` means that all missing values in column `b` should be replaced with the value 3. One can specify as many `replacement_specifier`s as one wishes.

The first example uses the simple form:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,missing,3], b=[4,5,6])
q = df |> @replacena(0) |> DataFrame
println(q)
# output
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 0 │ 5 │
│ 3 │ 3 │ 6 │
```

The next example uses a different replacement value for column `a` and `b`:

```jldoctest
using Query, DataFrames
df = DataFrame(a=[1,2,missing], b=["One",missing,"Three"])
q = df |> @replacena(:b=>"Unknown", :a=>0) |> DataFrame
println(q)
# output
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼─────────┤
│ 1 │ 1 │ One │
│ 2 │ 2 │ Unknown │
│ 3 │ 0 │ Three │
```
2 changes: 1 addition & 1 deletion src/Query.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ export @from, @query, @count, Grouping, key
export @map, @filter, @groupby, @orderby, @orderby_descending, @unique,
@thenby, @thenby_descending, @groupjoin, @join, @mapmany, @take, @drop

export @select, @rename, @mutate
export @select, @rename, @mutate, @dissallowna, @dropna, @replacena

export isna, NA

Expand Down
37 changes: 37 additions & 0 deletions src/table_query_macros.jl
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,40 @@ macro mutate(args...)

return :( Query.@map( $prev ) ) |> esc
end

our_get(x) = x
our_get(x::DataValue) = get(x)

our_get(x, y) = x
our_get(x::DataValue, y) = get(x, y)

macro dissallowna()
return :( Query.@map(map(our_get, _)) )
end

macro dissallowna(columns...)
return :( Query.@mutate( $( ( :( $(columns[i].value) = our_get(_.$(columns[i].value)) ) for i=1:length(columns) )... ) ) )
end

macro dropna()
return :( i-> i |> Query.@filter(!any(isna, _)) |> Query.@dissallowna() )
end

macro dropna(columns...)
return :( i-> i |> Query.@filter(!any(($((:(isna(_.$(columns[i].value))) for i in 1:length(columns) )...),))) |> Query.@dissallowna($(columns...)) )
end

macro replacena(arg, args...)
if length(args)==0 && !(arg isa Expr && arg.head==:call && length(arg.args)==3 && arg.args[1]==:(=>))
return :( Query.@map(map(i->our_get(i, $arg), _)) )
else
args = [arg; args...]

all(i isa Expr && i.head==:call && length(i.args)==3 && i.args[1]==:(=>) for i in args) || error("Invalid syntax.")

columns = map(i->i.args[2].value, args)
replacement_values = map(i->i.args[3], args)

return :( Query.@mutate( $( ( :( $(columns[i]) = our_get(_.$(columns[i]), $(replacement_values[i])) ) for i=1:length(columns) )... ) ) )
end
end
36 changes: 36 additions & 0 deletions test/test_macros.jl
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,39 @@ end
closure_val = 1
@test DataFrame(df |> @mutate(foo = closure_val)) == DataFrame(foo=[1,1,1], bar=[3.,2.,1.], bat=["a","b","c"])
end

@testset "@dropna" begin

df = DataFrame(a=[1,missing,3], b=[1.,2.,3.])

@test df |> @dropna() |> collect == [(a=1,b=1.), (a=3, b=3.)]
@test df |> @filter(!any(isna, _)) |> @dropna() |> collect == [(a=1,b=1.), (a=3, b=3.)]
@test df |> @select(:b) |> @dropna() |> collect == [(b=1.,),(b=2.,),(b=3.,)]

@test df |> @dropna(:a) |> collect == [(a=1,b=1.), (a=3, b=3.)]
@test df |> @dropna(:b) |> collect == [(a=DataValue(1),b=1.), (a=DataValue{Int}(),b=2.),(a=DataValue(3), b=3.)]
@test df |> @dropna(:a, :b) |> collect == [(a=1,b=1.), (a=3, b=3.)]
end

@testset "@replacena" begin

df = DataFrame(a=[1,missing,3], b=[1.,2.,3.])

@test df |> @replacena(2) |> collect == [(a=1,b=1.), (a=2, b=2.), (a=3, b=3.)]
@test df |> @dropna() |> @replacena(2) |> collect == [(a=1,b=1.), (a=3, b=3.)]
@test df |> @select(:b) |> @replacena(2) |> collect == [(b=1.,),(b=2.,),(b=3.,)]

@test df |> @replacena(:a=>2) |> collect == [(a=1,b=1.), (a=2, b=2.), (a=3, b=3.)]
@test df |> @replacena(:b=>2) |> collect == [(a=DataValue(1),b=1.), (a=DataValue{Int}(),b=2.),(a=DataValue(3), b=3.)]
@test df |> @replacena(:a=>2, :b=>8) |> collect == [(a=1,b=1.), (a=2, b=2.), (a=3, b=3.)]
end

@testset "@dissallowna" begin

df = DataFrame(a=[1,missing,3], b=[1.,2.,3.])

@test_throws DataValueException df |> @dissallowna() |> collect
@test df |> @filter(!any(isna, _)) |> @dissallowna() |> collect == [(a=1,b=1.), (a=3, b=3.)]
@test_throws DataValueException df |> @dissallowna(:a) |> collect
@test df |> @dissallowna(:b) |> collect == [(a=DataValue(1),b=1.), (a=DataValue{Int}(),b=2.),(a=DataValue(3), b=3.)]
end

0 comments on commit a4c0ea1

Please sign in to comment.