-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added droplevel function to dataframe #1622
Conversation
@itholic Can you please review and let me know your feedback? |
@jijosg Yup, I'll take a look at this soon. Thanks for the work on this :D |
You can reformat the code style by using If use it, the code style will be changed like the below and can be passed the build test. - pdf = pd.DataFrame([
- [1, 2, 3, 4],
- [5, 6, 7, 8],
- [9, 10, 11, 12]
- ]).set_index([0, 1]).rename_axis(['a', 'b'])
-
- pdf.columns = pd.MultiIndex.from_tuples([('c', 'e'), ('d', 'f')],
- names=['level_1', 'level_2'])
- kdf = ks.from_pandas(pdf)
- self.assert_eq(pdf.droplevel('a'), kdf.droplevel('a'))
- self.assert_eq(pdf.droplevel('level_1', axis=1), kdf.droplevel('level_1', axis=1))
- self.assertRaises(ValueError, lambda: kdf.droplevel(['a', 'b']))
- self.assertRaises(ValueError, lambda: kdf.droplevel(['level_1', 'level_2'], axis=1))
+ pdf = (
+ pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
+ .set_index([0, 1])
+ .rename_axis(["a", "b"])
+ )
+
+ pdf.columns = pd.MultiIndex.from_tuples(
+ [("c", "e"), ("d", "f")], names=["level_1", "level_2"]
+ )
+ kdf = ks.from_pandas(pdf)
+ self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a"))
+ self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1))
+ self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"]))
+ self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1)) |
I am using Windows for building the code hence was not able to run dev/reformat before committing...thanks for reviewing (y) |
databricks/koalas/frame.py
Outdated
... [1, 2, 3, 4], | ||
... [5, 6, 7, 8], | ||
... [9, 10, 11, 12] | ||
... ]).set_index([0, 1]).rename_axis(['a', 'b']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we have a different behaviour from pandas here.
When create a DataFrame
using list of list, the type of default columns is going to be a object
unlike pandas like the below.
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> pdf = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> kdf.columns
Index(['0', '1', '2', '3'], dtype='object') # the type of default columns for Koalas is `object`
>>> pdf.columns
RangeIndex(start=0, stop=4, step=1) # whereas for pandas is `int64`
>>> pdf.columns.dtype
dtype('int64')
cc @ueshin @HyukjinKwon . Should we fix this? or Is there a special reason why Koalas is using such behaviour ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far column names accept only string or tuple of string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jijosg , Could you fix this test like the below for now because Koalas has not rename_axis
yet ?
>>> df = ks.DataFrame(
... [[3, 4], [7, 8], [11, 12]],
... index=pd.MultiIndex.from_tuples([(1, 2), (5, 6), (9, 10)], names=["a", "b"]),
... )
>>> df.columns = pd.MultiIndex.from_tuples([
... ('c', 'e'), ('d', 'f')
... ], names=['level_1', 'level_2'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure i will make the changes , thanks for looking into this
databricks/koalas/frame.py
Outdated
9 10 11 12 | ||
""" | ||
axis = validate_axis(axis) | ||
internal = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but In Koalas, internal
is usually used as an instance for InternalFrame
.
Can we use another name like just kdf
or something ?
databricks/koalas/frame.py
Outdated
6 7 8 | ||
10 11 12 | ||
|
||
>>> df.droplevel('level2', axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level2
-> level_2
databricks/koalas/frame.py
Outdated
... ('c', 'e'), ('d', 'f') | ||
... ], names=['level_1', 'level_2']) | ||
|
||
>>> df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need # doctest: +NORMALIZE_WHITESPACE
here and belows.
you can
>>> df # doctest: +NORMALIZE_WHITESPACE
level_1 c d
level_2 e f
a b
1 2 3 4
5 6 7 8
9 10 11 12
>>> df.droplevel('a') # doctest: +NORMALIZE_WHITESPACE
level_1 c d
level_2 e f
b
2 3 4
6 7 8
10 11 12
>>> df.droplevel('level_2', axis=1) # doctest: +NORMALIZE_WHITESPACE
level_1 c d
a b
1 2 3 4
5 6 7 8
9 10 11 12
databricks/koalas/frame.py
Outdated
"levels: at least one level must be " | ||
"left.".format(len(level), nlevels) | ||
) | ||
internal = internal.reset_index(level).drop(level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use Spark drop
function here via internal Spark DataFrame to reduce the number of running Spark job.
For example, you can do here like as below.
drop_spark_index_columns = list()
index_spark_column_names = internal._internal.index_spark_column_names
for n in level:
if isinstance(n, int):
index_order = n
elif isinstance(n, (str, tuple)):
index_order = internal.index.names.index(n)
drop_spark_index_columns.append(index_spark_column_names[index_order])
sdf = internal._internal.spark_frame
sdf = sdf.drop(*drop_spark_index_columns)
index_map = internal._internal.index_map.copy()
for drop_spark_index_column in drop_spark_index_columns:
index_map.pop(drop_spark_index_column)
internal_frame = internal._internal.copy(spark_frame=sdf, index_map=index_map)
internal = DataFrame(internal_frame)
Could you add this to the docs also ?? It is placed at |
Could you resolve the test failure? cc @ueshin @HyukjinKwon , can we just merge this after resolving test failure, and I'll integrate this and #1630 in separated PR after this and #1630 merged ? |
@@ -553,6 +553,22 @@ def test_dot_in_column_name(self): | |||
ks.Series([1]), | |||
) | |||
|
|||
def test_droplevel(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was first introduced in pandas 0.24.0.
I think we can test this method only in pandas >= 0.24.0 like the below.
def test_droplevel(self):
# droplevel is new in pandas 0.24.0
if LooseVersion(pd.__version__) >= LooseVersion("0.24.0"):
pdf = (
pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
.set_index([0, 1])
.rename_axis(["a", "b"])
)
pdf.columns = pd.MultiIndex.from_tuples(
[("c", "e"), ("d", "f")], names=["level_1", "level_2"]
)
kdf = ks.from_pandas(pdf)
self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a"))
self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1))
self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"]))
self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1))
Sure, let's do that. |
Thank you @jijosg . |
Resolves #1614
This pull request implements droplevel functionality for koalas dataframe.