How to save python pandas column de-duped to csv? - python

Working with a simple dataframe df:
ID | other columns
123
123
343
345
234
234
I want to save the first column to a csv but de-duped.
df['ID'].to_csv('file.csv')
How can I de-dupe before the save? Thank You

Need DataFrame.drop_duplicates if ID is column:
df.drop_duplicates(subset=['ID']).to_csv('file.csv')
If ID is index need Index.duplicated:
df = df[~df.index.duplicated()].to_csv('file.csv')

You may use unique elements or drop_duplicates to get unique list of ID's
df['ID'].unique().to_csv('file.csv')

Related

Python Pandas pull list of rows where blank in one column

I am working with a larger dataframe df.
The first column of the df is Customer_ID. It is then followed by about 50 columns.
When I pivot this dataframe I get something like this:
pv1 = pd.pivot_table(df,index=["Customer_ID","Category"],values=["Sales"],aggfunc=np.sum)
pv1.head()
Customer ID | Category | Sales
123 A 345345
B 6736
345 A 982764
It seems that there are blanks in the Customer ID column (see row 2).
I tried to pull a list of those rows, with this statement:
df.loc[df['Customer_ID'] == ' ']
But I get no results.

Concatenating all Possible Column Values of Other Unique Column

Problem Setting
Suppose I am given the following data frame.
ID category
223 MMO
223 Game
444 Finance
360 Reading
360 Book
This data frame has an ID column and it's associated category. Notice that the same ID can have multiple categories.
My goal is to create a new column, which contains the concatenation of all the possible categories for a given ID. This means:
Removing the old category column
Removing duplicate ID rows
The output would look like this.
ID category
223 MMO_Game
444 Finance
360 Reading_Book
Attempted Solution
My though process was to first create a groupby variable that would group category by ID.
groupby_ID = df['category'].groupby(df['ID'])
Now I can try and iterate through the grouped data and concatenate the strings.
for ID, category in groupby_appID:
I don't know how to go on at this point. Some pointers would be greatly appreciated!
You can groupby on ID and then apply a join with your desired separator:
In [142]:
df.groupby('ID')['category'].apply('_'.join)
Out[142]:
ID
223 MMO_Game
360 Reading_Book
444 Finance
Name: category, dtype: object
To get the exact desired output you can call reset_index with name param:
In [145]:
df.groupby('ID')['category'].apply('_'.join).reset_index(name='category')
Out[145]:
ID category
0 223 MMO_Game
1 360 Reading_Book
2 444 Finance

unstack a multiindex dataframe

I want to unstack a multi-index dataframe, which looks like this:
into another dataframe whose index is 'Worker_id', column names are 'Task_id' and values are 'Date_cnt'.
Could someone give a help?
I've tried df.unstack, but it automatically puts 'Date_cnt',rather than 'Task_id' as column names
Thanks!
I think this is what you want:
import pandas as pd
df = pd.DataFrame([[4529,338,6],[4529,340,4],[4529,346,4],[4529,388,4],[4529,824,1]], columns = ['Worker_id','Task_id','Date_cnt'])
df = df.set_index(['Worker_id','Task_id']).unstack()
df.columns = df.columns.droplevel()
print df
Task_id 338 340 346 388 824
Worker_id
4529 6 4 4 4 1
Because there is only one column, the Date_cnt is the very top field in the columns multiindex- if you had multiple columns before unstacking, they would all be at the very top. Since you don't want to keep that, you can just drop the column.

replace column values in one dataframe by values of another dataframe

I have two dataframes , the first one has 1000 rows and looks like:
Date Group Family Bonus
2011-06-09 tri23_1 Laavin 456
2011-07-09 hsgç_T2 Grendy 679
2011-09-10 bbbj-1Y_jn Fantol 431
2011-11-02 hsgç_T2 Gondow 569
The column Group has different values, sometimes repeated, but in general about 50 unique values.
The second dataframe contains all these 50 unique values (50 rows) and also the hotels, that are associated to these values:
Group Hotel
tri23_1 Jamel
hsgç_T2 Frank
bbbj-1Y_jn Luxy
mlkl_781 Grand Hotel
vchs_94 Vancouver
My goal is to replace the value in the column Group of the first dataframe by the the corresponding values of the column Hotel of the second dataframe/or create the column Hotel with the corresponding values. When I try to make it just by assignment like
df1.loc[(df1.Group=df2.Group), 'Hotel']=df2.Hotel
I have an error that the dataframes are not of equal size, so the comparison is not possible
If you set the index to the 'Group' column on the other df then you can replace using map on your original df 'Group' column:
In [36]:
df['Group'] = df['Group'].map(df1.set_index('Group')['Hotel'])
df
Out[36]:
Date Group Family Bonus
0 2011-06-09 Jamel Laavin 456
1 2011-07-09 Frank Grendy 679
2 2011-09-10 Luxy Fantol 431
3 2011-11-02 Frank Gondow 569
You could also create a dictionary and use apply:
hotel_dict = df2.set_index('Group').to_dict()
df1['Group'] = df1['Group'].apply(lambda x: hotel_dict[x])
just use pandas join, you can refer to detail link: http://pandas.pydata.org/pandas-docs/stable/merging.html
df1.join(df2,on='Group')

Outputting transposed grouped pandas dataframe to CSV

Relatively new to pandas - I have a dataframe with movie ID, user ID, rating, and date. I've sorted by user ID and date and have the dataframe below.
https://i.stack.imgur.com/fqSZ6.png
My desired output is a csv that has one row per user, with all the movies that user has rated sorted chronologically left to right. For example:
452 4 33 6581
56
121 69 98 802 555
.
.
master_sample.sort_values(['User ID','Date']).groupby('User ID')
However, after grouping by user ID I get a groupby object, which I'm unsure how to iterate over and output to csv. I've tried pivot on the original df / iterating over the grouped df using get_group.
Any pointers would be appreciated!
Try this:
master_sample.sort_values('Date') \
.groupby('User ID', as_index=False)['Movie ID'] \
.apply(' '.join)

Resources