Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a ptype argument to between() since we can't make it cast to the type of x at this time #6906

Closed
Generalized opened this issue Aug 7, 2023 · 5 comments · Fixed by #7073
Labels
feature a feature request or enhancement tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day vctrs ↗️

Comments

@Generalized
Copy link

Generalized commented Aug 7, 2023

The result of using between() in filtering is not equivalent to evaluating a conditional expression when working with ordered factors.

> (x <- data.frame(Visit = c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20")))
      Visit
1 Discharge
2   Month 3
3   Month 6
4  Month 12
5  Month 20

> class(x$Visit)
[1] "character"

Now let's make it an ordered factor with assumed levels (used later for other operations):

> (x$Visit <- factor(x$Visit, levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE))
[1] Discharge Month 3   Month 6   Month 12  Month 20 
Levels: Baseline < Discharge < Month 3 < Month 6 < Month 12 < Month 20 < Premature

When filtering using the conjunction of conditions it works

> x %>% filter(Visit >= "Discharge" & Visit <= "Month 20")
      Visit
1 Discharge
2   Month 3
3   Month 6
4  Month 12
5  Month 20

When using the convenient between(), it doesn't

> x %>% filter(between(Visit, "Discharge", "Month 20"))
      Visit
1 Discharge
2  Month 12
3  Month 20

2 items are missing in the output.


R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

packageVersion("dplyr")
[1] ‘1.1.1’

@psychelzh
Copy link

psychelzh commented Aug 10, 2023

I think this is by design. Look at the documentation:

x, left, and right are all cast to their common type before the comparison is made.

But if you want the behavior, it could be tedious, see the following:

library(dplyr)
#> 
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
x <- data.frame(
  Visit = factor(
    c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"),
    levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), 
    ordered = TRUE
  )
)
x |> 
  filter(
    between(
      Visit, 
      factor("Discharge", levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE), 
      factor("Month 20", levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE)
    )
  )
#>       Visit
#> 1 Discharge
#> 2   Month 3
#> 3   Month 6
#> 4  Month 12
#> 5  Month 20

Created on 2023-08-10 with reprex v2.0.2

The logic behind this is just type casting used by vctrs::vec_cast_common(). You could use data.table::between() instead:

library(dplyr)
#> 
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
x <- data.frame(
  Visit = factor(
    c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"),
    levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), 
    ordered = TRUE
  )
)
x |> 
  filter(data.table::between(Visit, "Discharge", "Month 20"))
#>       Visit
#> 1 Discharge
#> 2   Month 3
#> 3   Month 6
#> 4  Month 12
#> 5  Month 20

Created on 2023-08-10 with reprex v2.0.2

@Generalized
Copy link
Author

Generalized commented Aug 13, 2023

Thank you very much for the quick explanation, @psychelzh!
This explains everything.

But, at the end of the day, although I know the source of this behaviour, this is even more frustrating.

Because when someone knows how the BETWEEN operator works in, say, SQL, where it's equivalent to the >= A AND <= B condition and expects such equivalence elsewhere (because why not?), the result I showed is confusing incredibly.

If only a warning or error, like "unequal types of both arguments/operands/variables/columns" were printed (the same way the binding functions or joins already do), it could save a lot of time wasted on investigating weird results (provided someone even noticed the result is incomplete! This case was trivial, but I can imagine a definitely more complex one with multiple conditions).

Please, @hadley, consider at least adding a warning, if between() cannot be adjusted for cases like this. In many (if not most) places tidyverse is very strict, to prevent confusing results, yet here, where I wouldn't expect any problem, it's far too liberal, giving unexpected results.

Without a warning or error, the explanation in the manual may not suffice - I guess some of people will be surprised even after reading it, because when we read about "common types" we usually think: "OK, number & number, character & character, factor & factor", not necessarily "factor_#0493" and "factor_#0493". And in case like that, the behaviour is really surprising at first sight.

@psychelzh
Copy link

psychelzh commented Aug 13, 2023

I do agree with this! In fact, I think {vctrs} sometimes really gets too strict or restricted. More frustration also comes from tidyverse/tidyr#1483. I think loose behavior like base R, such as casting number to character, is sometimes absolutely acceptable.

@DavisVaughan
Copy link
Member

This ultimately comes down to the fact that between() takes the common type of x, left, and right rather than doing a directional cast of left and right to the type of x.

In vctrs, the common type of factor and character is character (the more general type), so all inputs end up getting cast to character before the comparison.

I tried making between() stricter by making it cast left and right to the type of x, but it ended up breaking too much in the revdeps so we rolled it back:
#6478
#6482

You could make your own wrapper around between() that helps with this:

library(dplyr, warn.conflicts = FALSE)

x <- tibble(Visit = c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"))
x$Visit <- factor(x$Visit, levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE)
x
#> # A tibble: 5 × 1
#>   Visit    
#>   <ord>    
#> 1 Discharge
#> 2 Month 3  
#> 3 Month 6  
#> 4 Month 12 
#> 5 Month 20

between_cast_to_x <- function(x, left, right) {
  args <- vctrs::vec_cast_common(left = left, right = right, .to = x)
  left <- args$left
  right <- args$right
  between(x, left, right)
}

between(x$Visit, "Discharge", "Month 12")
#> [1]  TRUE FALSE FALSE  TRUE FALSE

between_cast_to_x(x$Visit, "Discharge", "Month 12")
#> [1]  TRUE  TRUE  TRUE  TRUE FALSE

I think the only thing we can do is consider exposing a ptype argument to between() to make doing a directional cast a little simpler. We already do this in other places where we take the common type, like coalesce(), so I think adding that argument here makes sense if we don't feel like we can make between() cast to the type of x (its the next best solution).

@DavisVaughan DavisVaughan changed the title Filtering ordered factor via between() does not agree with filtering using a conditional expression Add a ptype argument to between() since we can't make it cast to the type of x at this time Nov 6, 2023
@DavisVaughan DavisVaughan added feature a feature request or enhancement vctrs ↗️ labels Nov 6, 2023
@DavisVaughan DavisVaughan added the tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day label Jul 22, 2024
JamesHWade added a commit to JamesHWade/dplyr that referenced this issue Aug 15, 2024
DavisVaughan pushed a commit to JamesHWade/dplyr that referenced this issue Aug 27, 2024
DavisVaughan pushed a commit to JamesHWade/dplyr that referenced this issue Aug 27, 2024
@philibe
Copy link

philibe commented Sep 5, 2024

@Generalized

In many (if not most) places tidyverse is very strict, to prevent confusing results, yet here, where I wouldn't expect any problem, it's far too liberal, giving unexpected results.

IMHO Since 1.1.x dplyr is becoming too strict and there are too much warnings, but I have the surprise than some users want that. The solution for the users is to verify and its codes and datas with many scenarios.

The result of this goes to lack of flexibility and blackbox for the data transformation.

I use lot of dplyr since 0.7 and again now, but since 1.1 when some functions are becoming too strict I bypass it back to base R or I make invisible some of warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day vctrs ↗️
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants