Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added droplevel function to dataframe #1622

Merged
merged 7 commits into from
Jul 9, 2020

Conversation

jijosg
Copy link
Contributor

@jijosg jijosg commented Jul 1, 2020

Resolves #1614

This pull request implements droplevel functionality for koalas dataframe.

>>> df
level_1   c   d
level_2   e   f
a b
1 2      3   4
5 6      7   8
9 10    11  12

>>> df.droplevel('a')
level_1   c   d
level_2   e   f
b
2        3   4
6        7   8
10      11  12

>>> df.droplevel('level_2', axis=1)
level_1   c   d
a b
1 2      3   4
5 6      7   8
9 10    11  12

@jijosg
Copy link
Contributor Author

jijosg commented Jul 1, 2020

@itholic Can you please review and let me know your feedback?

@itholic
Copy link
Contributor

itholic commented Jul 1, 2020

@jijosg Yup, I'll take a look at this soon. Thanks for the work on this :D

@itholic
Copy link
Contributor

itholic commented Jul 1, 2020

You can reformat the code style by using dev/reformat.

If use it, the code style will be changed like the below and can be passed the build test.

-        pdf = pd.DataFrame([
-            [1, 2, 3, 4],
-            [5, 6, 7, 8],
-            [9, 10, 11, 12]
-        ]).set_index([0, 1]).rename_axis(['a', 'b'])
-
-        pdf.columns = pd.MultiIndex.from_tuples([('c', 'e'), ('d', 'f')],
-                                                names=['level_1', 'level_2'])
-        kdf = ks.from_pandas(pdf)
-        self.assert_eq(pdf.droplevel('a'), kdf.droplevel('a'))
-        self.assert_eq(pdf.droplevel('level_1', axis=1), kdf.droplevel('level_1', axis=1))
-        self.assertRaises(ValueError, lambda: kdf.droplevel(['a', 'b']))
-        self.assertRaises(ValueError, lambda: kdf.droplevel(['level_1', 'level_2'], axis=1))
+        pdf = (
+            pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
+            .set_index([0, 1])
+            .rename_axis(["a", "b"])
+        )
+
+        pdf.columns = pd.MultiIndex.from_tuples(
+            [("c", "e"), ("d", "f")], names=["level_1", "level_2"]
+        )
+        kdf = ks.from_pandas(pdf)
+        self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a"))
+        self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1))
+        self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"]))
+        self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1))

@jijosg
Copy link
Contributor Author

jijosg commented Jul 1, 2020

I am using Windows for building the code hence was not able to run dev/reformat before committing...thanks for reviewing (y)

... [1, 2, 3, 4],
... [5, 6, 7, 8],
... [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])
Copy link
Contributor

@itholic itholic Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have a different behaviour from pandas here.

When create a DataFrame using list of list, the type of default columns is going to be a object unlike pandas like the below.

>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> pdf = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> kdf.columns
Index(['0', '1', '2', '3'], dtype='object')  # the type of default columns for Koalas is `object`
>>> pdf.columns
RangeIndex(start=0, stop=4, step=1)  # whereas for pandas is `int64`
>>> pdf.columns.dtype
dtype('int64')

cc @ueshin @HyukjinKwon . Should we fix this? or Is there a special reason why Koalas is using such behaviour ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far column names accept only string or tuple of string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jijosg , Could you fix this test like the below for now because Koalas has not rename_axis yet ?

>>> df = ks.DataFrame(
...     [[3, 4], [7, 8], [11, 12]],
...     index=pd.MultiIndex.from_tuples([(1, 2), (5, 6), (9, 10)], names=["a", "b"]),
... )

>>> df.columns = pd.MultiIndex.from_tuples([
...   ('c', 'e'), ('d', 'f')
... ], names=['level_1', 'level_2'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i will make the changes , thanks for looking into this

9 10 11 12
"""
axis = validate_axis(axis)
internal = self.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but In Koalas, internal is usually used as an instance for InternalFrame.
Can we use another name like just kdf or something ?

6 7 8
10 11 12

>>> df.droplevel('level2', axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level2 -> level_2

... ('c', 'e'), ('d', 'f')
... ], names=['level_1', 'level_2'])

>>> df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need # doctest: +NORMALIZE_WHITESPACE here and belows.

you can

        >>> df  # doctest: +NORMALIZE_WHITESPACE
        level_1   c   d
        level_2   e   f
        a b
        1 2      3   4
        5 6      7   8
        9 10    11  12
        >>> df.droplevel('a')  # doctest: +NORMALIZE_WHITESPACE
        level_1   c   d
        level_2   e   f
        b
        2        3   4
        6        7   8
        10      11  12

        >>> df.droplevel('level_2', axis=1)  # doctest: +NORMALIZE_WHITESPACE
        level_1   c   d
        a b
        1 2      3   4
        5 6      7   8
        9 10    11  12

"levels: at least one level must be "
"left.".format(len(level), nlevels)
)
internal = internal.reset_index(level).drop(level)
Copy link
Contributor

@itholic itholic Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use Spark drop function here via internal Spark DataFrame to reduce the number of running Spark job.

For example, you can do here like as below.

            drop_spark_index_columns = list()
            index_spark_column_names = internal._internal.index_spark_column_names
            for n in level:
                if isinstance(n, int):
                    index_order = n
                elif isinstance(n, (str, tuple)):
                    index_order = internal.index.names.index(n)
                drop_spark_index_columns.append(index_spark_column_names[index_order])
            sdf = internal._internal.spark_frame
            sdf = sdf.drop(*drop_spark_index_columns)
            index_map = internal._internal.index_map.copy()
            for drop_spark_index_column in drop_spark_index_columns:
                index_map.pop(drop_spark_index_column)
            internal_frame = internal._internal.copy(spark_frame=sdf, index_map=index_map)
            internal = DataFrame(internal_frame)

@itholic
Copy link
Contributor

itholic commented Jul 2, 2020

Could you add this to the docs also ??

It is placed at docs/source/reference/frame.rst

@itholic
Copy link
Contributor

itholic commented Jul 5, 2020

Could you resolve the test failure?

cc @ueshin @HyukjinKwon , can we just merge this after resolving test failure, and I'll integrate this and #1630 in separated PR after this and #1630 merged ?

@@ -553,6 +553,22 @@ def test_dot_in_column_name(self):
ks.Series([1]),
)

def test_droplevel(self):
Copy link
Contributor

@itholic itholic Jul 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was first introduced in pandas 0.24.0.

I think we can test this method only in pandas >= 0.24.0 like the below.

def test_droplevel(self):
    # droplevel is new in pandas 0.24.0
    if LooseVersion(pd.__version__) >= LooseVersion("0.24.0"):
        pdf = (
            pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
            .set_index([0, 1])
            .rename_axis(["a", "b"])
        )

        pdf.columns = pd.MultiIndex.from_tuples(
            [("c", "e"), ("d", "f")], names=["level_1", "level_2"]
        )
        kdf = ks.from_pandas(pdf)
        self.assert_eq(pdf.droplevel("a"), kdf.droplevel("a"))
        self.assert_eq(pdf.droplevel("level_1", axis=1), kdf.droplevel("level_1", axis=1))
        self.assertRaises(ValueError, lambda: kdf.droplevel(["a", "b"]))
        self.assertRaises(ValueError, lambda: kdf.droplevel(["level_1", "level_2"], axis=1))

@HyukjinKwon
Copy link
Member

Could you resolve the test failure?

cc @ueshin @HyukjinKwon , can we just merge this after resolving test failure, and I'll integrate this and #1630 in separated PR after this and #1630 merged ?

Sure, let's do that.

@HyukjinKwon HyukjinKwon merged commit fd047b5 into databricks:master Jul 9, 2020
@HyukjinKwon
Copy link
Member

Thank you @jijosg .

@jijosg jijosg deleted the droplevel branch July 9, 2020 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.droplevel
4 participants