Wednesday, 15 February 2012

python - How to sum field across two DataFrames when the indexes don't line up? -



python - How to sum field across two DataFrames when the indexes don't line up? -

i brand new complex info analysis in general, , pandas in particular. have feeling pandas should able handle task easily, newbieness prevents me seeing path solution. want sum 1 column across 2 files @ given time each day, 3pm in case. if file doesn't have record @ 3pm day, want utilize previous record.

let me give concrete example. have info in 2 csv files. here couple little examples:

datetime value 2013-02-28 09:30:00 0.565019720442 2013-03-01 09:30:00 0.549536266504 2013-03-04 09:30:00 0.5023031467 2013-03-05 09:30:00 0.698370467751 2013-03-06 09:30:00 0.75834927162 2013-03-07 09:30:00 0.783620442226 2013-03-11 09:30:00 0.777265379462 2013-03-12 09:30:00 0.785787872851 2013-03-13 09:30:00 0.784873183044 2013-03-14 10:15:00 0.802959366653 2013-03-15 10:15:00 0.802959366653 2013-03-18 10:15:00 0.805413095911 2013-03-19 09:30:00 0.80816233134 2013-03-20 10:15:00 0.878912249996 2013-03-21 10:15:00 0.986393922571

and other:

datetime value 2013-02-28 05:00:00 0.0373634672097 2013-03-01 05:00:00 -0.24700085273 2013-03-04 05:00:00 -0.452964976056 2013-03-05 05:00:00 -0.2479288295 2013-03-06 05:00:00 -0.0326855588777 2013-03-07 05:00:00 0.0780461766619 2013-03-08 05:00:00 0.306247682656 2013-03-11 06:00:00 0.0194146154407 2013-03-12 05:30:00 0.0103653153719 2013-03-13 05:30:00 0.0350377752558 2013-03-14 05:30:00 0.0110884755383 2013-03-15 05:30:00 -0.173216846788 2013-03-19 05:30:00 -0.211785013352 2013-03-20 05:30:00 -0.891054563968 2013-03-21 05:30:00 -1.27207563599 2013-03-22 05:30:00 -1.28648629004 2013-03-25 05:30:00 -1.5459897419

note a) neither file has 3pm record, , b) 2 files don't have records given day. (2013-03-08 missing first file, while 2013-03-18 missing second, , first file ends before second.) output, envision dataframe (perhaps date without time):

datetime value 2013-feb-28 15:00:00 0.6023831876517 2013-mar-01 15:00:00 0.302535413774 2013-mar-04 15:00:00 0.049338170644 2013-mar-05 15:00:00 0.450441638251 2013-mar-06 15:00:00 0.7256637127423 2013-mar-07 15:00:00 0.8616666188879 2013-mar-08 15:00:00 0.306247682656 2013-mar-11 15:00:00 0.7966799949027 2013-mar-12 15:00:00 0.7961531882229 2013-mar-13 15:00:00 0.8199109582998 2013-mar-14 15:00:00 0.8140478421913 2013-mar-15 15:00:00 0.629742519865 2013-mar-18 15:00:00 0.805413095911 2013-mar-19 15:00:00 0.596377317988 2013-mar-20 15:00:00 -0.012142313972 2013-mar-21 15:00:00 -0.285681713419 2013-mar-22 15:00:00 -1.28648629004 2013-mar-25 15:00:00 -1.5459897419

i have feeling perhaps three-liner in pandas, it's not @ clear me how this. farther complicating thinking problem, more complex csv files might have multiple records single day (same date, different times). seems need somehow either generate new pair of input dataframes times @ 15:00 , sum across values columns keying on date, or during sum operation select record greatest time on given day time <= 15:00:00. given datetime.time objects can't compared magnitude, suspect might have grouping rows having same date, within each group, select row nearest (but not greater than) 3pm. kind of @ point brain explodes.

i got looking @ documentation, don't understand database-like operations pandas supports. pointers relevant documentation (especially tutorials) much appreciated.

first combine dataframes:

df3 = df1.append(df2)

so in 1 table, next utilize groupby sum across timestamps:

df4 = df3.groupby('datetime').aggregate(sum)

now d4 has value column sum of matching datetime columns. assuming have timestamps datetime objects, can whatever filtering @ stage:

filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]

i'm not sure trying do, may need parse timestamp columns before filtering.

python pandas

No comments:

Post a Comment