-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible joins #5910
Flexible joins #5910
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like exactly what I was imagining for extending the by
specification 😄
With this interface in place there are a couple of other join helpers I can imagine:
join_any()
which would accept any list of predicates, by first joining a cartesian join (possibly in batches) and then applying the predicates. The performance wouldn't be any better than a cartesian join + a filter, but we since we could do it iteratively, we could reduce peak memory.join_at()
which would use tidyselect to make it easier to join by many common variables
I'm not sure we would actually implement these, but I like having the option.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
2087f24
to
217eef1
Compare
Note: Go back and answer a few popular community questions, like:
|
13d9f5a
to
535c8d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it up to the end of join-cols.R
. I'll tackle the rest later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great and I think this is good to merge once you've read through my smaller suggestions, and we're ready to start on dplyr 1.1.
This defaults to `keep = FALSE` on equi conditions, but `keep = TRUE` on non-equi conditions, as this is generally what you want. The information in columns involved in a non-equi condition rarely overlaps, so you almost never want to drop the keys of the RHS.
Merge branch 'main' into feature/vec-matches # Conflicts: # R/join-cols.R # R/join-rows.R # R/join.r # tests/testthat/_snaps/bind.md # tests/testthat/_snaps/join-cols.md # tests/testthat/_snaps/join-rows.md # tests/testthat/test-join-cols.R
I don't think these can error, as the common type is handled earlier, but it doesn't hurt and is good to be consistent
|
Closes #5914
Closes #5661
Closes #5413
Closes #2240 (a whopping 5 years old! with 138 thumbs up!)
Reading order
NEWS
join-by.R
join_by()
docs / examples and to get a general idea of how it worksjoin.R
join_rows()
nowjoin-rows.R
vec_matches()
arguments and callsvec_matches()
join-cols.R
keep = NULL
is handledstandardise_join_by()
was replaced byas_join_by()
andjoin_by_common()
injoin-by.R
Tests
join_by()
parsing checksdplyr_matches()
multiple = "all"
now to silence the new warningSummary
This PR implements has two main purposes:
It adds
join_by()
for creating a join specification which can be specified as theby
argument to any join. This allows:Equi join specification with unnamed left and right hand sides, i.e.
join_by(date1 == date2)
Specification of non-equi joins like
join_by(id, date1 >= date2)
Specification of rolling joins with
join_by(id, preceding(date1, date2))
andjoin_by(id, following(date1, date2))
Shortcuts for some complex non-equi joins:
between()
,overlaps()
,within()
.There is a restriction that the left and right hand side of the join conditions have to be symbols or strings, i.e. you can't do
join_by(x + 1 > y)
. It is unclear what the resulting column name should be if you did this, and you can get into order of operation issues, i.e.!x > y
where!
has a higher precedence than>
.To support non-equi joins,
keep
has gain a completely new default value ofNULL
. In non-equi joins likejoin_by(sale_date >= commercial_date)
, since the information in the two columns isn't exactly the same, you almost always want to keep both columns. Sokeep = NULL
implieskeep = FALSE
for equi conditions andkeep = TRUE
for non-equi conditions. This should be fully backwards compatible.It adds two new "quality control" arguments to most of the join functions that have been requested over the years.
multiple
is a new argument for controlling what happens when a row inx
matches multiple rows iny
. It allowsc("all", "first", "last", "warning", "error")
. The default isNULL
, which uses"warning"
for equi and rolling joins, where multiple matches are surprising, and"all"
for cross joins and when there is at least 1 non-equi join condition, where multiple matches are expected. This is a change from the current CRAN dplyr behavior, which never warns.unmatched
is a new argument for controlling what happens when a row would be dropped because it doesn't have a match. It allowsc("drop", "error")
.Combined with the proposed
enforce()
, this allows for a number of useful checks to ensure that you are doing a 1:1, 1:m, m:1, m:m style join.Not implemented
Future enhancements for other PRs (as suggested from comments):
join_any()
that would accept arbitrary predicates and would perform a much slower cartesian join + filter using the predicates. It could do it in batches to reduce peak memory. This would require a completely separate engine path for each type of join.join_at()
a tidy select interface for equi joins if you have many columns of the same name to join by. This would have anas_join_by()
method that translated the result to adplyr_join_by
object.Examples
<\details>