python - Filtering top-level categories in hierarchical Pandas Dataframe using lower-level data -
i have pandas dataframe contains big number of categories each have features , each of have own subfeatures grouped pairs. simple version looks following:
0 1 ... categories features subfeatures cat1 feature1 subfeature1 -0.224487 -0.227524 subfeature2 -0.591399 -0.799228 feature2 subfeature1 1.190110 -1.365895 ... subfeature2 0.720956 -1.325562 cat2 feature1 subfeature1 1.856932 nan subfeature2 -1.354258 -0.740473 feature2 subfeature1 0.234075 -1.362235 ... subfeature2 0.013875 1.309564 cat3 feature1 subfeature1 nan nan subfeature2 -1.260408 1.559721 ... feature2 subfeature1 0.419246 0.084386 subfeature2 0.969270 1.493417 ... ... ...
it can generated using next code:
import pandas pd import numpy np np.random.seed(seed=90) results = np.random.randn(3,2,2,2) results[2,0,0,:] = np.nan results[1,0,0,1] = np.nan results = results.reshape((-1,2)) index = pd.multiindex.from_product([["cat1", "cat2", "cat3"], ["feature1", "feature2"], ["subfeature1", "subfeature2"]], names=["categories", "features", "subfeatures"]) df = pd.dataframe(results, index=index)
now retrieve top-level categories (cat1
etc) have difference between subfeature1
, subfeature2
in same column (0
or 1
) above threshold.
for example: if threshold 1 expect cat2
, cat3
returned because difference between subfeature1
, subfeature2
in column 0
1.856932 - (-1.354258), 3.21119 > threshold = 1 feature1
in cat2
. similarly, difference between subfeature1
, subfeature2
in column 1
in cat3
, feature2
1.493417 - 0.084386 = 1.409031 > 1. on other hand, cat1
not returned because none differences between subfeature pairs greater 1. nan
values invalidate pair , ignored.
i have managed implement iterative approach, sense not taking advantage of pandas' total capabilities , performance lacking:
for cat in df.index.levels[0]: feature in df.index.levels[1]: df2 = df.xs((cat, feature)) diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2']) if max(diffs) > threshold , cat not in results: results.append(cat)
yielding:
['cat2', 'cat3']
how go implementing using pandas' built-in vectorized abilities?
edit: using jeff's reply below, noticed funky:
def f(x): = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures'))) print homecoming > 1 result = df.groupby(level=['categories','features']).filter(f) print(result)
gives:
0.366912262765 0.571703714569 1 0.469153603312 0.0403331129905 3.2111900125 <------------------------------------------------ nan 0.220200012413 2.67179897269 <--------------------------------------------------- nan nan 0.550023734074 1.40903094796 <-----------------------------------------------------!!!!!!!!!!! 0 1 categories features subfeatures cat2 feature1 subfeature1 1.856932 nan subfeature2 -1.354258 -0.740473
i've highlighted places algorithm should include category based on score. yet, doesn't cat3
. nans have it?
groupby top-2 levels. utilize filter homecoming max difference of features want (threshold here 0)
in [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0) out[41]: 0 1 categories features subfeatures cat1 feature1 subfeature1 -0.224487 -0.227524 subfeature2 -0.591399 -0.799228 feature2 subfeature1 1.190110 -1.365895 subfeature2 0.720956 -1.325562 cat2 feature1 subfeature1 1.856932 nan subfeature2 -1.354258 -0.740473 feature2 subfeature1 0.234075 -1.362235 subfeature2 0.013875 1.309564
a useful debugging aid to this:
def f(x): print x homecoming (x.xs(......)) # e.g. filter above df.groupby(.....).filter(f)
python pandas hierarchical-data
No comments:
Post a Comment