
Five methods for Filtering data with multiple conditions in Python - min2bro
https://kanoki.org/2020/01/21/pandas-dataframe-filter-with-multiple-conditions/
======
danpalmer
Title should probably clarify that this is with Pandas, that's much more
specific and less generally useful than "in Python".

Original title: "Pandas dataframe filter with Multiple conditions"

------
RobinL
To help readability I tend to do something like this:

f1 = (df["col1"] == condition1)

f2 = (df["col2"] == condition2)

df[f1 & f2]

This is equivalent to the 'pandas boolean indexing multiple conditions'
method.

~~~
abricot
I do the same without the parenthesis.

------
TrackerFF
This is probably gonna be sacrilege to the Pythonians, but I often wish there
was support for some SQL-like syntax when working with (pandas) data frames.
It certainly would make the process a lot smoother for some tasks.

~~~
thenipper
I'd love if someone took the next step from Pandas being influenced by R to
port dplyr-esque syntax to it.

~~~
closed
I've been working on a library over the past year that does exactly that,
including generating dbplyr style SQL queries!

Would love your feedback :)

[https://github.com/machow/siuba](https://github.com/machow/siuba)

~~~
bkfunk
This looks interesting! Can I ask, what's the origin of the name? Both "siu"
and "siuba"? Would be a useful addition to the docs, IMO.

~~~
closed
Ah, thanks for pointing out the lack of explanation. I'll add one to the
readme.

It's a transliteration of the cantonese word for minibus, 小巴 :).

edit: siu (小) means little!

------
SiempreViernes
This seems to be about doing filtering with Pandas, not pure python. The title
should probably be changed to reflect this.

------
brian_herman__
Yeah it looks like his code that this person uploaded isnt escaping the HTML
or is being unescaped when it should be escaped.

df.loc[(df['Salary_in_1000']>=100) &amp; (df['Age']&lt; 60) &amp;
(df['FT_Team'].str.startswith('S')),['Name','FT_Team']]

------
lordgrenville
Would have been nice to see a comparison of performance, or at least which is
suggested style.

~~~
cyorir
My guess is that eval and query have the worst performance, since they need to
be interpreted first. As for the other methods, I'm not too sure how they
rank. Personally, for style I agree with the first comment on the linked page,
which is to split up the conditions for readability:

    
    
      f1 = (df[“col1”] == condition1)
    
      f2 = (df[“col2”] == condition2)
    
      df[f1 & f2]

------
closed
One thing that really surprises me: NONE of these methods work with grouped
DataFrames.

But grouping data is extremely common in data analysis.

Basically, the strategy with grouped data, is taking the loc approach, and
sprinkling in a bunch of additional .transform calls. :/

~~~
devxpy
I wonder if this could be improved with SQL

------
data_ders
I strongly prefer .query() for legibility and that it can but used in a pipe.
My only problem is that often flake8 will not detect the use of a variable
inside of the query string. Has anyone else come across this before?

~~~
truculent
If you supply a function to `.loc`, you can use it in pipes pretty easily.
E.g.

``` my_dataframe.loc[lambda df: df['col'] > 0.8] ```

Would this help work around your issue?

------
antman
Some speed comparison on a larger dataset would be interesting

