Breeding: python - Filtering top-level categories in hierarchical Pandas Dataframe using lower-level data -

Monday, 15 June 2015

python - Filtering top-level categories in hierarchical Pandas Dataframe using lower-level data -

i have pandas dataframe contains big number of categories each have features , each of have own subfeatures grouped pairs. simple version looks following:

                                        0         1    ... categories features subfeatures                     cat1       feature1 subfeature1 -0.224487 -0.227524                     subfeature2 -0.591399 -0.799228            feature2 subfeature1  1.190110 -1.365895    ...                     subfeature2  0.720956 -1.325562 cat2       feature1 subfeature1  1.856932       nan                     subfeature2 -1.354258 -0.740473            feature2 subfeature1  0.234075 -1.362235    ...                     subfeature2  0.013875  1.309564 cat3       feature1 subfeature1       nan       nan                     subfeature2 -1.260408  1.559721    ...            feature2 subfeature1  0.419246  0.084386                     subfeature2  0.969270  1.493417  ...                    ...               ...

it can generated using next code:

import pandas pd import numpy np  np.random.seed(seed=90) results = np.random.randn(3,2,2,2) results[2,0,0,:] = np.nan results[1,0,0,1] = np.nan results = results.reshape((-1,2)) index = pd.multiindex.from_product([["cat1", "cat2", "cat3"],                                     ["feature1", "feature2"],                                      ["subfeature1", "subfeature2"]],                                     names=["categories", "features", "subfeatures"]) df = pd.dataframe(results, index=index)

now retrieve top-level categories (cat1 etc) have difference between subfeature1 , subfeature2 in same column (0 or 1) above threshold.

for example: if threshold 1 expect cat2 , cat3 returned because difference between subfeature1 , subfeature2 in column 0 1.856932 - (-1.354258), 3.21119 > threshold = 1 feature1 in cat2. similarly, difference between subfeature1 , subfeature2 in column 1 in cat3, feature2 1.493417 - 0.084386 = 1.409031 > 1. on other hand, cat1 not returned because none differences between subfeature pairs greater 1. nan values invalidate pair , ignored.

what have tried

i have managed implement iterative approach, sense not taking advantage of pandas' total capabilities , performance lacking:

for cat in df.index.levels[0]:     feature in df.index.levels[1]:         df2 = df.xs((cat, feature))         diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])         if max(diffs) > threshold , cat not in results:             results.append(cat)

yielding:

['cat2', 'cat3']

how go implementing using pandas' built-in vectorized abilities?

edit: using jeff's reply below, noticed funky:

def f(x):     = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))     print      homecoming > 1  result = df.groupby(level=['categories','features']).filter(f) print(result)

gives:

0.366912262765 0.571703714569 1 0.469153603312 0.0403331129905 3.2111900125 <------------------------------------------------ nan 0.220200012413 2.67179897269  <--------------------------------------------------- nan nan 0.550023734074 1.40903094796  <-----------------------------------------------------!!!!!!!!!!!                                         0         1 categories features subfeatures                     cat2       feature1 subfeature1  1.856932       nan                     subfeature2 -1.354258 -0.740473

i've highlighted places algorithm should include category based on score. yet, doesn't cat3. nans have it?

groupby top-2 levels. utilize filter homecoming max difference of features want (threshold here 0)

in [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0) out[41]:                                          0         1 categories features subfeatures                     cat1       feature1 subfeature1 -0.224487 -0.227524                     subfeature2 -0.591399 -0.799228            feature2 subfeature1  1.190110 -1.365895                     subfeature2  0.720956 -1.325562 cat2       feature1 subfeature1  1.856932       nan                     subfeature2 -1.354258 -0.740473            feature2 subfeature1  0.234075 -1.362235                     subfeature2  0.013875  1.309564

a useful debugging aid to this:

def f(x): print x homecoming (x.xs(......)) # e.g. filter above df.groupby(.....).filter(f)

python pandas hierarchical-data

Breeding

Monday, 15 June 2015

python - Filtering top-level categories in hierarchical Pandas Dataframe using lower-level data -

No comments:

Post a Comment