Added droplevel function to dataframe #1622

jijosg · 2020-07-01T04:31:13Z

Resolves #1614

This pull request implements droplevel functionality for koalas dataframe.

>>> df
level_1   c   d
level_2   e   f
a b
1 2      3   4
5 6      7   8
9 10    11  12

>>> df.droplevel('a')
level_1   c   d
level_2   e   f
b
2        3   4
6        7   8
10      11  12

>>> df.droplevel('level_2', axis=1)
level_1   c   d
a b
1 2      3   4
5 6      7   8
9 10    11  12

jijosg · 2020-07-01T05:41:12Z

@itholic Can you please review and let me know your feedback?

itholic · 2020-07-01T07:49:45Z

@jijosg Yup, I'll take a look at this soon. Thanks for the work on this :D

itholic · 2020-07-01T12:03:02Z

You can reformat the code style by using dev/reformat.

If use it, the code style will be changed like the below and can be passed the build test.

-        pdf = pd.DataFrame([
-            [1, 2, 3, 4],
-            [5, 6, 7, 8],
-            [9, 10, 11, 12]
-        ]).set_index([0, 1]).rename_axis(['a', 'b'])
-
-        pdf.columns = pd.MultiIndex.from_tuples([('c', 'e'), ('d', 'f')],
-                                                names=['level_1', 'level_2'])
-        kdf = ks.from_pandas(pdf)
-        self.assert_eq(pdf.droplevel('a'), kdf.droplevel('a'))
-        self.assert_eq(pdf.droplevel('level_1', axis=1), kdf.droplevel('level_1', axis=1))
-        self.assertRaises(ValueError, lambda: kdf.droplevel(['a', 'b']))
-        self.assertRaises(ValueError, lambda: kdf.droplevel(['level_1', 'level_2'], axis=1))
+        pdf = (
+            pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
+            .set_index([0, 1])
+            .rename_axis(["a", "b"])
+        )
+
+        pdf.columns = pd.MultiIndex.from_tuples(
+            [("c", "e"), ("d", "f")], names=["level_1", "level_2"]
+        )
+        kdf = ks.from_pandas(pdf)
+        self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a"))
+        self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1))
+        self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"]))
+        self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1))

jijosg · 2020-07-01T12:18:27Z

I am using Windows for building the code hence was not able to run dev/reformat before committing...thanks for reviewing (y)

databricks/koalas/frame.py

itholic · 2020-07-02T06:43:03Z

databricks/koalas/frame.py

+        ...    [1, 2, 3, 4],
+        ...    [5, 6, 7, 8],
+        ...    [9, 10, 11, 12]
+        ... ]).set_index([0, 1]).rename_axis(['a', 'b'])


Seems like we have a different behaviour from pandas here.

When create a DataFrame using list of list, the type of default columns is going to be a object unlike pandas like the below.

>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) >>> pdf = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) >>> kdf.columns Index(['0', '1', '2', '3'], dtype='object') # the type of default columns for Koalas is `object` >>> pdf.columns RangeIndex(start=0, stop=4, step=1) # whereas for pandas is `int64` >>> pdf.columns.dtype dtype('int64')

cc @ueshin @HyukjinKwon . Should we fix this? or Is there a special reason why Koalas is using such behaviour ?

So far column names accept only string or tuple of string.

@jijosg , Could you fix this test like the below for now because Koalas has not rename_axis yet ?

>>> df = ks.DataFrame( ... [[3, 4], [7, 8], [11, 12]], ... index=pd.MultiIndex.from_tuples([(1, 2), (5, 6), (9, 10)], names=["a", "b"]), ... ) >>> df.columns = pd.MultiIndex.from_tuples([ ... ('c', 'e'), ('d', 'f') ... ], names=['level_1', 'level_2'])

@ueshin Thanks!

sure i will make the changes , thanks for looking into this

itholic · 2020-07-02T07:37:09Z

databricks/koalas/frame.py

+        9 10    11  12
+        """
+        axis = validate_axis(axis)
+        internal = self.copy()


Not a big deal, but In Koalas, internal is usually used as an instance for InternalFrame.
Can we use another name like just kdf or something ?

itholic · 2020-07-02T08:54:13Z

databricks/koalas/frame.py

+        6        7   8
+        10      11  12
+
+        >>> df.droplevel('level2', axis=1)


level2 -> level_2

itholic · 2020-07-02T08:55:19Z

databricks/koalas/frame.py

+        ...   ('c', 'e'), ('d', 'f')
+        ... ], names=['level_1', 'level_2'])
+
+        >>> df


I think we need # doctest: +NORMALIZE_WHITESPACE here and belows.

you can

>>> df # doctest: +NORMALIZE_WHITESPACE level_1 c d level_2 e f a b 1 2 3 4 5 6 7 8 9 10 11 12 >>> df.droplevel('a') # doctest: +NORMALIZE_WHITESPACE level_1 c d level_2 e f b 2 3 4 6 7 8 10 11 12 >>> df.droplevel('level_2', axis=1) # doctest: +NORMALIZE_WHITESPACE level_1 c d a b 1 2 3 4 5 6 7 8 9 10 11 12

itholic · 2020-07-02T08:56:48Z

databricks/koalas/frame.py

+                    "levels: at least one level must be "
+                    "left.".format(len(level), nlevels)
+                )
+            internal = internal.reset_index(level).drop(level)


I think we should use Spark drop function here via internal Spark DataFrame to reduce the number of running Spark job.

For example, you can do here like as below.

drop_spark_index_columns = list() index_spark_column_names = internal._internal.index_spark_column_names for n in level: if isinstance(n, int): index_order = n elif isinstance(n, (str, tuple)): index_order = internal.index.names.index(n) drop_spark_index_columns.append(index_spark_column_names[index_order]) sdf = internal._internal.spark_frame sdf = sdf.drop(*drop_spark_index_columns) index_map = internal._internal.index_map.copy() for drop_spark_index_column in drop_spark_index_columns: index_map.pop(drop_spark_index_column) internal_frame = internal._internal.copy(spark_frame=sdf, index_map=index_map) internal = DataFrame(internal_frame)

itholic · 2020-07-02T09:39:46Z

Could you add this to the docs also ??

It is placed at docs/source/reference/frame.rst

itholic · 2020-07-05T06:57:05Z

Could you resolve the test failure?

cc @ueshin @HyukjinKwon , can we just merge this after resolving test failure, and I'll integrate this and #1630 in separated PR after this and #1630 merged ?

itholic · 2020-07-05T21:29:02Z

databricks/koalas/tests/test_dataframe.py

@@ -553,6 +553,22 @@ def test_dot_in_column_name(self):
            ks.Series([1]),
        )

+    def test_droplevel(self):


This method was first introduced in pandas 0.24.0.

I think we can test this method only in pandas >= 0.24.0 like the below.

def test_droplevel(self): # droplevel is new in pandas 0.24.0 if LooseVersion(pd.__version__) >= LooseVersion("0.24.0"): pdf = ( pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) .set_index([0, 1]) .rename_axis(["a", "b"]) ) pdf.columns = pd.MultiIndex.from_tuples( [("c", "e"), ("d", "f")], names=["level_1", "level_2"] ) kdf = ks.from_pandas(pdf) self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a")) self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1)) self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"])) self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1))

HyukjinKwon · 2020-07-06T02:13:50Z

Could you resolve the test failure?

cc @ueshin @HyukjinKwon , can we just merge this after resolving test failure, and I'll integrate this and #1630 in separated PR after this and #1630 merged ?

Sure, let's do that.

HyukjinKwon · 2020-07-09T12:14:40Z

Thank you @jijosg .

jijosg added 2 commits July 1, 2020 09:53

Added droplevel function to dataframe

e747c55

Reducing line length for lint tests

f97960c

Fixed reformat issue in test_dataframe

7656351

itholic reviewed Jul 1, 2020

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

itholic reviewed Jul 1, 2020

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

Removed commented out code

28f302b

itholic reviewed Jul 2, 2020

View reviewed changes

databricks/koalas/frame.py Outdated

6 7 8

10 11 12

>>> df.droplevel('level2', axis=1)

Copy link

Contributor

itholic Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level2 -> level_2

itholic reviewed Jul 2, 2020

View reviewed changes

jijosg added 2 commits July 3, 2020 16:23

Added reviewed changes and updated docs/frame.rst

46e48e3

Merge remote-tracking branch 'upstream/master' into droplevel

3f2926a

itholic reviewed Jul 5, 2020

View reviewed changes

skip test for pandas < 0.24.0

ea08d25

HyukjinKwon approved these changes Jul 9, 2020

View reviewed changes

HyukjinKwon merged commit fd047b5 into databricks:master Jul 9, 2020

jijosg deleted the droplevel branch July 9, 2020 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added droplevel function to dataframe #1622

Added droplevel function to dataframe #1622

jijosg commented Jul 1, 2020 •

edited

Loading

jijosg commented Jul 1, 2020

itholic commented Jul 1, 2020 •

edited

Loading

itholic commented Jul 1, 2020 •

edited

Loading

jijosg commented Jul 1, 2020 •

edited

Loading

itholic Jul 2, 2020 •

edited

Loading

ueshin Jul 2, 2020

itholic Jul 2, 2020

itholic Jul 2, 2020

jijosg Jul 2, 2020

itholic Jul 2, 2020

itholic Jul 2, 2020

itholic Jul 2, 2020

itholic Jul 2, 2020 •

edited

Loading

itholic commented Jul 2, 2020

itholic commented Jul 5, 2020

itholic Jul 5, 2020 •

edited

Loading

HyukjinKwon commented Jul 6, 2020

HyukjinKwon commented Jul 9, 2020

Added droplevel function to dataframe #1622

Added droplevel function to dataframe #1622

Conversation

jijosg commented Jul 1, 2020 • edited Loading

jijosg commented Jul 1, 2020

itholic commented Jul 1, 2020 • edited Loading

itholic commented Jul 1, 2020 • edited Loading

jijosg commented Jul 1, 2020 • edited Loading

itholic Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

ueshin Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020

Choose a reason for hiding this comment

jijosg Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020

Choose a reason for hiding this comment

itholic Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

itholic commented Jul 2, 2020

itholic commented Jul 5, 2020

itholic Jul 5, 2020 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Jul 6, 2020

HyukjinKwon commented Jul 9, 2020

jijosg commented Jul 1, 2020 •

edited

Loading

itholic commented Jul 1, 2020 •

edited

Loading

itholic commented Jul 1, 2020 •

edited

Loading

jijosg commented Jul 1, 2020 •

edited

Loading

itholic Jul 2, 2020 •

edited

Loading

itholic Jul 2, 2020 •

edited

Loading

itholic Jul 5, 2020 •

edited

Loading