Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the missing Series.swaplevel #1919

Merged
merged 13 commits into from
Nov 23, 2020

Conversation

xinrong-meng
Copy link
Contributor

@xinrong-meng xinrong-meng commented Nov 18, 2020

No description provided.

@ueshin
Copy link
Collaborator

ueshin commented Nov 18, 2020

FYI: You have to remove the API from "missing" files and add a link in the doc.

@itholic
Copy link
Contributor

itholic commented Nov 19, 2020

You can find the missing files in databricks/koalas/missing/series.py and doc link in docs/source/reference/series.rst :)

@codecov-io
Copy link

codecov-io commented Nov 20, 2020

Codecov Report

Merging #1919 (ced185b) into master (7877587) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1919      +/-   ##
==========================================
+ Coverage   94.08%   94.16%   +0.08%     
==========================================
  Files          41       41              
  Lines       10019    10022       +3     
==========================================
+ Hits         9426     9437      +11     
+ Misses        593      585       -8     
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100.00% <ø> (ø)
databricks/koalas/series.py 97.07% <100.00%> (+0.04%) ⬆️
databricks/koalas/namespace.py 84.23% <0.00%> (ø)
databricks/koalas/indexes.py 96.85% <0.00%> (+0.01%) ⬆️
databricks/koalas/groupby.py 91.42% <0.00%> (+0.98%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7877587...ced185b. Read the comment docs.

@xinrong-meng xinrong-meng marked this pull request as ready for review November 21, 2020 01:02
index_map = list(zip(self._internal.index_spark_column_names, self._internal.index_names))
index_map[i], index_map[j], = index_map[j], index_map[i]
index_spark_column_names, index_names = zip(*index_map)
internal = self._kdf._internal.copy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._internal instead ofself._kdf._internal ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask what's their difference at usage?

I see return self._kdf._internal.select_column(self._column_label) is for self._internal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._kdf._internal contains other columns from the anchor DataFrame.

E.g.,:

>>> midx = pd.MultiIndex.from_arrays([['a', 'b'], [1, 2]], names = ['word', 'number'])
>>> kdf = ks.DataFrame({'a': [1, 2], 'b': ['x', 'y']}, index=midx)
>>> kdf
             a  b
word number
a    1       1  x
b    2       2  y
>>> kdf.b.swaplevel()
number  word
1       a       1
2       b       2
Name: a, dtype: int64

should be:

>>> kdf.b.swaplevel()
number  word
1       a       x
2       b       y
Name: b, dtype: object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

I'm a little confused by "contains". Their spark_frame seems to be identical.

>>> kdf.b._internal.spark_frame.show()
+-----------------+-----------------+---+---+-----------------+
|__index_level_0__|__index_level_1__|  a|  b|__natural_order__|
+-----------------+-----------------+---+---+-----------------+
|                a|                1|  1|  x|      42949672960|
|                b|                2|  2|  y|      94489280512|
+-----------------+-----------------+---+---+-----------------+

>>> kdf.b._kdf._internal.spark_frame.show()
+-----------------+-----------------+---+---+-----------------+
|__index_level_0__|__index_level_1__|  a|  b|__natural_order__|
+-----------------+-----------------+---+---+-----------------+
|                a|                1|  1|  x|      42949672960|
|                b|                2|  2|  y|      94489280512|
+-----------------+-----------------+---+---+-----------------+

Copy link
Collaborator

@ueshin ueshin Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean their metadata:

>>> kdf.b._internal.column_labels
[('b',)]
>>> kdf.b._kdf._internal.column_labels
[('a',), ('b',)]

or

>>> kdf.b._internal.data_spark_columns
[Column<b'b'>]
>>> kdf.b._kdf._internal.data_spark_columns
[Column<b'a'>, Column<b'b'>]

The column_labels or data_spark_columns in the InternalFrame for Series contains only one for the Series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's clear to me now. Thank you!

@xinrong-meng xinrong-meng requested a review from ueshin November 23, 2020 17:25
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

@xinrong-meng xinrong-meng merged commit 95ec75e into databricks:master Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants