Sequence printing of Dictionary inside key pair values into excel file using python pandas dataframe [on hold] - python

I have a Dictionary Like this:
final_dict[key] = {'key1':value1,'key2':value2,'key3':value3,'key4':value4,'key5':value5}
But it display in excel file like this (output):
key3 key5 key1 key4 key2
key value3 value5 value1 value4 value2
key value3 value5 value1 value4 value2
But I need those dictionary inside key pair values in sequence order as how i stored that values in the dictioanry..!!!

Related

generate a unique integer ID from multiple columns of a SQL table

I need to generate a new column in a table in a sql database.
the given table is:
id1 value1 value2 value3 value4
9465 387 801 1990 20
All columns are integer. value1 and value2 are always 3 digits, value3 are year value, value4 is not more than 3 digits.
I need to generate a value by combining value1 to value4, suppose that it is called "value_combine". The "value_combine" should be unique. For example, given the different combinations of value1 to value4, the "value_combine" should also be different.
And then, the "value_combine" is combined with id1 such that the new value (we call it final_id) should be unique.
For example, given different combinations of id1 and value_combine, the final_id should also be different.
The final_id can be used to identify each unique combination of id1 and value1-4.
The final_id MUST be integer and all values should have the same length of digits, such as 6, 7 or 8 digits.
Any help would be appreciated.
Perhaps I'm missing something, but sounds like a DENSE_RANK() would do:
SELECT id1,value1,value2,value3,value4
,DENSE_RANK() OVER(ORDER BY id1,value1,value2,value3,value4)+100000 AS final_ID
FROM YourTable

Split DataFrame by date and recombine by appending new records and overwriting existing

Edit: I updated the description below to try to make it more clear what I am trying to accomplish.
I am a fairly new Python user (I usually use R but I am trying to learn Python). I am trying to use pandas to accomplish the following.
I have a DataFrame (df) similar to the one below (my real dataset has many more columns):
PROG.ID TITLE STATUS DataDate
--------- ------- -------- --------------
KEY1 Key 1 A 2007-01-01
KEY2 Key 2 A 2007-01-01
KEY3 Key 3 A 2008-07-01
KEY2 Key 2 I 2009-07-01
KEY4 Key 4 A 2010-01-01
I am trying to output multiple dataframes based on the DataDate field like the following:
In File_2007-01-01.csv:
PROG.ID TITLE STATUS DataDate
--------- ------- -------- --------------
KEY1 Key 1 A 2007-01-01
KEY2 Key 2 A 2007-01-01
Both KEY1 and KEY2 added as these are the only records with this date.
In File_2008-07-01.csv:
PROG.ID TITLE STATUS DataDate
--------- ------- -------- --------------
KEY1 Key 1 A 2007-01-01
KEY2 Key 2 A 2007-01-01
KEY3 Key 3 A 2008-07-01
KEY3 was added since it was not there.
In File_2009-07-01.csv:
PROG.ID TITLE STATUS DataDate
--------- ------- -------- --------------
KEY1 Key 1 A 2007-01-01
KEY2 Key 2 I 2009-07-01
KEY3 Key 3 A 2008-07-01
Notice KEY2 has been replaced with the most recent record. The other records remain unchanged.
And in File_2009-07-01.csv:
PROG.ID TITLE STATUS DataDate
--------- ------- -------- --------------
KEY1 Key 1 A 2007-01-01
KEY2 Key 2 I 2009-07-01
KEY3 Key 3 A 2008-07-01
KEY4 Key 4 A 2010-01-01
KEY4 was added. Other records remain unchanged.
I have tried using code like the following (but this does not work):
df = df.set_index('PROG.ID')
result = pd.DataFrame()
for key, data in df.groupby('DataDate'):
if result.empty:
result.data.copy()
else:
result.combine_first(data)
result.update(data)
result.to_csv('./File_{dt}.csv'.format(dt=key))
The first file gets written correctly, but all subsequent files have the same data as the first.
It is my understanding that combine_first() will keep all of result and add the rows from data that are not already in result, while update() will overwrite the values in result by the values in data where the keys already exist in result. Just for completeness, I tried update() before combine_first() as well.
Unfortunately, this does not work as expected. I have looked at other questions that have been answered in the past, but none that I found answer how to update all existing records but append new records.
And to answer the question, we have an existing workflow that takes data formatted like the output format and processes it. I need this data to flow through that same workflow.
Any insight would be greatly appreciated.
IIUC:
df = df.sort_values('DataDate')
for d in df['DataDate'].dt.strftime('%Y-%m-%d').unique():
df.loc[df['DataDate'] <= d] \
.groupby('PROG.ID', as_index=False).last() \
.to_csv(r'd:/temp/File_{}.csv'.format(d), index=False)
Results:
File_2007-01-01.csv
PROG.ID,TITLE,STATUS,DataDate
KEY1,Key 1,A,2007-01-01
KEY2,Key 2,A,2007-01-01
File_2008-07-01.csv
PROG.ID,TITLE,STATUS,DataDate
KEY1,Key 1,A,2007-01-01
KEY2,Key 2,A,2007-01-01
KEY3,Key 3,A,2008-07-01
File_2009-07-01.csv
PROG.ID,TITLE,STATUS,DataDate
KEY1,Key 1,A,2007-01-01
KEY2,Key 2,I,2009-07-01
KEY3,Key 3,A,2008-07-01
File_2010-01-01.csv
PROG.ID,TITLE,STATUS,DataDate
KEY1,Key 1,A,2007-01-01
KEY2,Key 2,I,2009-07-01
KEY3,Key 3,A,2008-07-01
KEY4,Key 4,A,2010-01-01

how to distinct the spark rdd by the key?

now, I have a RDD, which the records in the RDD are as follows:
key1 value1
key1 value2
key2 value3
key3 value4
key3 value5
I want to get the RDD records which have different keys ,as follows:
key1 value1
key2 value3
key3 value4
I can just use the spark-core APIs and don't aggregate values of the same key.
You could do this with PairRDDFunctions.reduceByKey. Assuming you have an RDD[(K, V)]:
rdd.reduceByKey((a, b) => if (someCondition) a else b)
With data frames and collect_set:
sqlContext.createDataFrame(rdd).toDF("k", "v")
.groupBy("k")
.agg(collect_set(col("v")))

Python Pandas Create Records from Complex Dictionary

I have processed some very complex nested json objects to get the following general dictionary format:
{'key1':'value1',
'key2':'value2',
'key3':'value3',
'key4':'value4',
'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]}
In the list of lists, each list indicates something that should be an "individual transaction" equivalent. Each transaction shares key1, key2, key3, key4 pairs. There can be an arbitrary number of lists. I am trying to efficiently turn these into records in a pandas dataframe like the following:
key1_field, key2_field, key3_field, key4_field, key5_or_key6_field_1, key5_or_key6_field_2, key5_or_key6_field_3, key5_or_key6_indicator
value1, value2, value3, value 4, value5, value6, value7, key5
value1, value2, value3, value 4, value5, value6, value7, key6
value1, value2, value3, value 4, value8, value9, value10, key5
value1, value2, value3, value 4, value8, value9, value10, key6
Any assistance would be sincerely appreciated! It has been a challenge enough getting this to this point. Thanks!
EDIT:
As asked, I can post how I have been trying to approach this:
import pandas as pd
import numpy as np
d = {'key1':'value1',
'key2':'value2',
'key3':'value3',
'key4':'value4',
'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]}
df = pd.DataFrame({k : pd.Series(v) for k, v in d.iteritems()})
My remaining issue is that the single key values are NaN after the first row.
Try this:
pd.DataFrame({k : pd.Series(v) for k, v in d.iteritems()}).ffill()
One option is to read the dictionary as it is and reshape the data frame:
df = pd.DataFrame({'key1':'value1',
'key2':'value2',
'key3':'value3',
'key4':'value4',
'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]})
df.set_index(['key1', 'key2', 'key3', 'key4']).stack().apply(pd.Series) \
.rename(columns = lambda x: "value_" + str(x)).reset_index()
# key1 key2 key3 key4 level_4 value_0 value_1 value_2
# 0 value1 value2 value3 value4 key5 value5 value6 value7
# 1 value1 value2 value3 value4 key6 value5 value6 value7
# 2 value1 value2 value3 value4 key5 value8 value9 value10
# 3 value1 value2 value3 value4 key6 value8 value9 value10

Always display columns in rdlc Matrix

Using a Tablix Matrix I am aggregating a count based on the values in a single column. With results similar the below.
Value1 Value2 Value3 Value4
4 2 3 3
The problem occurs that with certain criteria there might not be a value2 or value3. When this happen the column does not appear at all. How do I force it to show all columns even if there is nothing to count? Thanks
Data source:
FieldToCount PersonName
Value1 Mandy
Value3 John
Value3 Jack
Value4 Jack
Value2 Mandy
Value1 John
If you use the Count function in the report, try to write this formula in the column cell
=IIf(Count(Fields!FieldToCount.Value) > 0,
Count(Fields!FieldToCount.Value),
0)
Have your tried setting the
RepeatColumnHeaders = True
?
It is located in the properties box after you click the whole matrix.

Resources