Maximum of Series' means with panda - python

I've been searching for a while now and I can't figure out how to do that in other way than hardcoding it.
In a csv file, I have to obtain the maximum value of the mean of three series, and then return the NAME of that series, which is the part causing me troubles.
comp = max(DataTD['Cycle (seconds)'].mean(), DataTD['Run (seconds)'].mean(), DataTD['Swim (seconds)'].mean())
if comp == DataTD['Cycle (seconds)'].mean():
print(DataTD['Cycle (seconds)'].name)
elif comp == DataTD['Run (seconds)'].mean():
print(DataTD['Run (seconds)'].name)
elif comp == DataTD['Swim (seconds)'].mean():
print(DataTD['Swim (seconds)'].name)

Something like this should work (haven't test it):
datas = [DataTD['Cycle (seconds)'],
DataTD['Run (seconds)'],
DataTD['Swim (seconds)']]
means = [data.mean() for data in datas]
max_mean_idx = np.argmax(means)
print(datas[max_mean_idx].name)

You can get the maximum column name, then based on that grab the mean.
cols = ['Cycle (seconds)', 'Run (seconds)', 'Swim (seconds)']
max_col = max(cols, key=lambda col: DataTD[col].mean())
print('Column name: ' + max_col)
print('Mean: ' + str(DataTD[max_col].mean()))

Data From Pir
DataTD.loc[DataTD[cols].mean(1).sort_values().iloc[[-1]].index]
Out[625]:
Cycle (seconds) Run (seconds) Swim (seconds)
J 7 5 9

Consider the sample data
cols = ['Cycle (seconds)', 'Run (seconds)', 'Swim (seconds)']
np.random.seed([3, 1415])
DataTD = pd.DataFrame(
np.random.randint(10, size=(10, 3)),
list('ABCDEFGHIJ'), cols
)
Cycle (seconds) Run (seconds) Swim (seconds)
A 0 2 7
B 3 8 7
C 0 6 8
D 6 0 2
E 0 4 9
F 7 3 2
G 4 3 3
H 6 7 7
I 4 5 3
J 7 5 9
IIUC:
Use mean with argument axis=1 followed with an idxmax to find the index for the maximal value.
DataTD.loc[[DataTD[cols].mean(1).idxmax()]]
Cycle (seconds) Run (seconds) Swim (seconds)
J 7 5 9
Otherwise, try
means = DataTD.mean()
means[[means.idxmax()]]
Swim (seconds) 5.7
dtype: float64

Related

Calculate win rates depending on game duration with Pandas

I have two DataFrames that looks like this:
match_id hero_name won
0 0 Rubick True
1 9 Rubick False
2 16 Rubick False
3 28 Rubick True
4 37 Rubick True
match_id duration
0 0 2375
1 1 2582
2 2 2716
3 3 3085
4 4 1887
And I'm trying to calculate the win percentage by game length of each hero. So far I calculated the percentage for buckets of 5 minutes each, but I can't find a way to plot a curve that fits the points I calculated.
Here's what I've tried:
matches = pd.merge(a, b, on='match_id')
matches['lost'] = ~matches['won']
# Ther's a single match with over 16000s, the others are all less than 6673
matches = matches[matches.duration < 7000]
# From 50s to 7000s with 5 minutes steps
ranges = np.arange(50, 7000, 300)
# Group by hero_name and duration range and count won and lost games
g = matches.groupby(['hero_name', pd.cut(matches.duration, ranges)]).sum()
# Win percentage
g['win_p'] = g['won'] / (g['won'] + g['lost'])
# Drop other columns and fills some NaN
g = g[['win_p']]
g = g.fillna(0)
g = g.reset_index()
g = g.groupby('hero_name')
n = g.ngroups
But I don't know how to plot a curve. I've tried the kde() but it only counts victories and not a win rate and if I use it on the already calculated percentages I get the sum of them. How can I interpolate a curve and plot the win rate depending on the duration for each hero?
You need unstack here , more info look at here
g.win_p.unstack(level=0).plot( subplots=True)

Apply formula across pandas rows/ regression line

I'm trying to apply a formula across the rows of a data frame to get the trend of the numbers in the rows.
The below example works until the part where .apply is used.
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
axisvalues=list(range(1,len(db.columns)+1))
def calc_slope(row):
return scipy.stats.linregress(df.iloc[row,:], y=axisvalues)
calc_slope(1) # this works
df["New"]=df.apply(calc_slope,axis=1) # this fails *- "too many values to unpack"*
Thank you for any help
I think you need for one attribute:
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return a.slope
df["slope"]=df.apply(calc_slope,axis=1)
print (df)
A B C D slope
0 0.444640 0.024624 -0.016216 0.228935 -2.553465
1 1.226611 1.962481 1.103834 0.645562 -1.455239
2 -0.259415 0.971097 0.124538 -0.704115 -0.718621
3 1.938422 1.787310 -0.619745 -2.560187 -0.575519
4 -0.986231 -1.942930 2.677379 -1.813071 0.075679
5 0.611214 -0.258453 0.053452 1.223544 0.841865
6 0.685435 0.962880 -1.517077 -0.101108 -0.652503
7 0.368278 1.314202 0.748189 2.116189 1.350132
8 -0.322053 -1.135443 -0.161071 -1.836761 -0.987341
9 0.798461 0.461736 -0.665127 -0.247887 -1.610447
And for all atributes convert named tuple to dict and then to Series. Output is new DataFrame, so if is necessaryjoin to original:
np.random.seed(1997)
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
axisvalues=list(range(1,len(df.columns)+1))
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return pd.Series(a._asdict())
print (df.apply(calc_slope,axis=1))
slope intercept rvalue pvalue stderr
0 -2.553465 2.935355 -0.419126 0.580874 3.911302
1 -1.455239 4.296670 -0.615324 0.384676 1.318236
2 -0.718621 2.523733 -0.395862 0.604138 1.178774
3 -0.575519 2.578530 -0.956682 0.043318 0.123843
4 0.075679 2.539066 0.127254 0.872746 0.417101
5 0.841865 2.156991 0.425333 0.574667 1.266674
6 -0.652503 2.504915 -0.561947 0.438053 0.679154
7 1.350132 0.965285 0.794704 0.205296 0.729193
8 -0.987341 1.647104 -0.593680 0.406320 0.946311
9 -1.610447 2.639780 -0.828856 0.171144 0.768641
df = df.join(df.apply(calc_slope,axis=1))
print (df)
A B C D slope intercept rvalue \
0 0.444640 0.024624 -0.016216 0.228935 -2.553465 2.935355 -0.419126
1 1.226611 1.962481 1.103834 0.645562 -1.455239 4.296670 -0.615324
2 -0.259415 0.971097 0.124538 -0.704115 -0.718621 2.523733 -0.395862
3 1.938422 1.787310 -0.619745 -2.560187 -0.575519 2.578530 -0.956682
4 -0.986231 -1.942930 2.677379 -1.813071 0.075679 2.539066 0.127254
5 0.611214 -0.258453 0.053452 1.223544 0.841865 2.156991 0.425333
6 0.685435 0.962880 -1.517077 -0.101108 -0.652503 2.504915 -0.561947
7 0.368278 1.314202 0.748189 2.116189 1.350132 0.965285 0.794704
8 -0.322053 -1.135443 -0.161071 -1.836761 -0.987341 1.647104 -0.593680
9 0.798461 0.461736 -0.665127 -0.247887 -1.610447 2.639780 -0.828856
pvalue stderr
0 0.580874 3.911302
1 0.384676 1.318236
2 0.604138 1.178774
3 0.043318 0.123843
4 0.872746 0.417101
5 0.574667 1.266674
6 0.438053 0.679154
7 0.205296 0.729193
8 0.406320 0.946311
9 0.171144 0.768641

Multiplying data within columns python

I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626

Pandas: cumulative functions application

Consider the simple dataframe example using pandas:
df = pd.DataFrame({'x' : [10, 20, 30, 40]}, index = ['0','1','2', '3'])
This gives the following:
index x
0 10
1 20
2 30
3 40
I'm trying to take values of x and for each row produce a result (via a lambda) that also utilises the previous row calculation. That is, I'd like to calculate y[i+1] as a function of x[i+1] and y[i]. So for example:
y[i+1] = sin(x[i+1]) + (15 * y[i])
So this would give the following DataFrame:
index x y
0 10 -0.54
1 20 -7.2
2 30 -109.7
3 40 -1644.7
For the first row, this is presumably a special case (as there is no y[-1])? So I'd like to give this a specific number.
I have been trying to solve this with expanding_apply, but with no joy. Thanks.
UPDATE
So I answered my question, in a way I understand, with the help below (thank you):
df.loc[:,'y'] = 0
initial_y_val = 10
for i in range (0, df.shape[0]):
if i == 0 : df.iloc[0,1] = initial_y_val + df.iloc[0,0]
else : df.iloc[i,1] = df.iloc[i,0] + df.iloc[(i-1),1]
print df
This gives:
x y
0 10 20
1 20 40
2 30 70
3 40 110
So my question is, is there a more idiomatic (and faster) way of achieving the same outcome?
There is the cumsum from pandas which solves your problem:
df['y'] = df.x.cumsum()
In [171]: df
Out[171]:
x y
0 10 10
1 20 30
2 30 60
3 40 100
Edit:
Very nice question indeed, you can see by developping y1, y2, ...,yn that it's a growing polynomial of sin(x) with coefficient which are power of 15. I would opt for this solution by iterating over the DataFrame index:
z = df.x.map(math.sin)
df['y']=[sum(z[:i]*15**np.arange(int(i)+1)[::-1]) for i,r in df.iterrows()]
In [258]: df
Out[258]:
x y
0 10 -0.544021
1 20 -7.247371
2 30 -109.698603
3 40 -1644.733929

Python and Pandas - column that “count direction” and show “average until now”

I have a DataFrame that contain price (of a stock) at the end of specific minute.
DF columns are:
minute_id: 0-1440, 0 for midnight, 480 for 8:00 AM (60*8)
price:stock price at the end of the minute
change: price change from prev. minute
direction: direction of the change
import numpy.random as nprnd
from pandas import DataFrame
n = 10 # Number of samples
# Starting at 8:00 AM, set some (n) random prices between 4-5
df = DataFrame({'minute_id': range(480,480+n), 'price':(5-4) * nprnd.random(n) + 4 })
df['change'] = df.price - df.price.shift(1)
df['direction'] = df.change.map(lambda x: 0 if x == 0 else x/abs(x))
df = df.dropna()
df
I want to add few columns to this DF.
Average price until now
for the first row, it will have the price.
for the 2nd row, it will have the average price of the 2 first rows
for the n-th row, it will have the average price of the first n rows
Sum of the 'change' column while in the current direction
(Will be zeroed every time 'direction' switched)
count in the current direction, until now
For every row, what is the number of this row in the current direction run.
Average price of the last 4 rows
I can create all of those columns by iterating through the DF row at a time.
But am sure there is a more (pythonic|pandastic) way for doing it.
I'm also not sure how to handle missing data (If i have gaps within the minute_id)
EDIT:
out of the 4 columns I wanted to add, 1 and 4 are easy...
C4: this is just a rolling mean with a period of 4
C1: rolling mean can get another parameter for the minimum period.
setting it to 1 and setting the windows size to the length of the df will give a running mean for every row in the set.
df['rolling_avg'] = pd.rolling_mean(df.price, n, 1)
For the other 2 columns, I'm still trying to find the best way to get it.
OK, After a lot of "playing around" I've got something that works for me.
It might be done in a little more "Pandastic" way, but this is a reasonable way to get it done.
I want to thanks Andy Hayden, Jeff and Phillip Cloud for pointing out to the "10 minutes to pandas"
It didn't contain the direct answers, but was very helpful.
Also, Andy Hayden send me to create rolling mean, which helped me much as a direction.
So lets do it column by column
Adding col 1: Average price until now
# Rolling avg, windows size is the size of the entire DataFrame, with minimum of 1
df['rolling_avg'] = pd.rolling_mean(df.price, n, 1)
Adding col 4: Avarage price of the last 4 rows
df['RA_wnd_4'] = pd.rolling_mean(df.price, 4, 1)
Adding col 2: CumSum() of the 'change' column while in the current "blcok" (direction)
# Adding Helper column that shows when direction have been changed
df['dir_change'] = (df.direction.shift(1) != df.direction).astype(int)
# Identify the DF "blocks" for every direction change
df['block'] = df.dir_change.cumsum()
# Split the DF based on those bolcks
grouped = df.groupby('block')
# Add Function that will cumsum() for a block, and call it
def f1(group):
return DataFrame({'rolling_count' : group.cumsum()})
df['rolling_count'] = grouped.change.apply(f1)
Adding col 3: Row number in the current "block" (Direction)
df['one'] = 1
df['rolling_count'] = grouped.one.apply(f1)
df = df.drop('one', axis=1)
The full code:
import numpy.random as nprnd
from pandas import DataFrame
import pandas as pd
n = 10 # Number of samples
# Starting at 8:00 AM, set some (n) random prices between 4-5
df = DataFrame({'minute_id': range(480,480+n), 'price':(5-4) * nprnd.random(n) + 4 })
df['change'] = df.price - df.price.shift(1)
df['direction'] = df.change.map(lambda x: 0 if x == 0 else x/abs(x))
df = df.dropna()
#------------------------------------------
# Col 1, rolling Avg over the entire DF
df['rolling_avg'] = pd.rolling_mean(df.price, n, 1)
#------------------------------------------
# Col 4, rolling Avg windows size of 4
df['RA_wnd_4'] = pd.rolling_mean(df.price, 4, 1)
#------------------------------------------
# Helper code for cols 2, 3
# Adding Helper column that shows when direction have been changed
df['dir_change'] = (df.direction.shift(1) != df.direction).astype(int)
# Identify the DF "blocks" for every direction change
df['block'] = df.dir_change.cumsum()
# Split the DF based on those bolcks
grouped = df.groupby('block')
# Add Function that will cumsum() for a block, and call it
def f1(group):
return DataFrame({'rolling_count' : group.cumsum()})
df['one'] = 1
#------------------------------------------
# Col 2, CumSum() of the 'change' column while in the current "blcok" (direction)
df['rolling_count'] = grouped.change.apply(f1)
#------------------------------------------
# Col 3, Count in the current "block" (Direction)
df['rolling_count'] = grouped.one.apply(f1)
df = df.drop('one', axis=1)
print df
Output:
minute_id price change direction rolling_avg RA_wnd_4 dir_change block rolling_count
1 481 4.771701 0.474349 1 4.771701 4.771701 1 1 1
2 482 4.300078 -0.471623 -1 4.535889 4.535889 1 2 1
3 483 4.946744 0.646666 1 4.672841 4.672841 1 3 1
4 484 4.529403 -0.417340 -1 4.636981 4.636981 1 4 1
5 485 4.434598 -0.094805 -1 4.596505 4.552706 0 4 2
6 486 4.171169 -0.263429 -1 4.525616 4.520479 0 4 3
7 487 4.416980 0.245810 1 4.510096 4.388038 1 5 1
8 488 4.727078 0.310098 1 4.537219 4.437456 0 5 2
9 489 4.049097 -0.677981 -1 4.482983 4.341081 1 6 1

Resources