Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Index.difference to avoid collect 'other' to driver side #2173

Merged
merged 1 commit into from
Jun 15, 2021

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jun 15, 2021

This PR basically came from SPARK-35683.

This PR fix the wrong behavior of Index.difference in Koalas, based on the comment #1325 (comment) and #1325 (comment)

  • it couldn't handle the case properly when self is Index or MultiIndex and other is MultiIndex or Index.
>>> midx1 = ks.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ks.Index([1, 2, 3])
>>> midx1 = ks.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
databricks.koalas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
  • it's collecting the all data into the driver side when the other is list-like objects, especially when the other is distributed object such as Series which is very dangerous.

And added the related test cases.

@HyukjinKwon HyukjinKwon changed the title Fix Index.different to work properly Fix Index.difference to avoid collect 'other' to driver side Jun 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2021

Codecov Report

Merging #2173 (5f133a5) into master (f971143) will decrease coverage by 0.00%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2173      +/-   ##
==========================================
- Coverage   95.34%   95.34%   -0.01%     
==========================================
  Files          60       60              
  Lines       13723    13737      +14     
==========================================
+ Hits        13084    13097      +13     
- Misses        639      640       +1     
Impacted Files Coverage Δ
databricks/koalas/indexes/base.py 97.17% <91.66%> (-0.15%) ⬇️
databricks/koalas/tests/indexes/test_base.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f971143...5f133a5. Read the comment docs.

@HyukjinKwon HyukjinKwon merged commit 1fa3c11 into databricks:master Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants