python: indexerror about 'Trending #Hashtags' - python

I am doing groklearning python courses and I met a problem:
The question is as below:
the capitalisation and punctuation in hashtags is inconsistent. You decide to write a program to read in tweets, normalise any hashtags present, and print out a tally of frequencies. Hashtags should only include words starting with #. All punctuation should be removed from the end of a hashtag, and the letters should be converted to lowercase. For instance, #Python! should be normalised to #python, and #Today_I_Learned... should be #today_i_learned.
The output is meant to be
Tweet: #Python is #AWESOME!
Tweet: This is #So_much_fun #awesome
#so_much_fun 1
#awesome 2
#python 1
And my code is as below
from collections import Counter
import string
tweet = input('Tweet: ')
lis = tweet.lower().strip().split()
lis_hash = []
while tweet:
for i in lis:
i = i.rstrip(string.punctuation)
if i[0] == '#':
tweet = input('Tweet: ')
lis = tweet.lower().strip().split()
ans = Counter(lis_hash)
for i in ans:
My code does work on the example but when I was trying to submit, the error appeared as 'Testing a long example. Your submission raised an exception of type IndexError. This occurred on line 9 of your submission.' where line 9 refers to 'if i[0] == '#':' in my code.
I have no idea of this error, could anyone help me?

Try printing i with the following input string test , tost. You will see that in the second iteration i will be an empty string(''), and trying to index an empty string(''[0]) will result in the given error.
So you need if i != '' and i[0] == '#': or something of the like.


Deciphering Python word unscrambler

I'm currently trying to dig deep into python and I have found a challenge on ( that I'm trying to crack. I have to unscramble 10 words that are found in the provided wordlist.
def permutation(s):
if s == "":
return [s]
ans = []
for an in permutation(s[1:]):
for pos in range(len(an)+1):
return ans
def dictionary(wordlist):
dict = {}
infile = open(wordlist, "r")
for line in infile:
word = line.split("\n")[0]
# all words in lower case!!!
word = word.lower()
dict[word] = 1
return dict
def main():
diction = dictionary("wordlist.txt")
# enter all the words that fit on a line or limit the number
anagram = raw_input("Please enter space separated words you need to unscramble: ")
wordLst = anagram.split(None)
for word in wordLst:
anaLst = permutation(word)
for ana in anaLst:
if diction.has_key(ana):
diction[ana] = word
#print "The solution to the jumble is" , ana
solutionLst = []
for k, v in diction.iteritems():
if v != 1:
print "%s unscrambled = %s" % (v, k)
print solutionLst
The function permutation looks like it is the block of code that actually does the deciphering. Can you help me understand how it programatically is solving this?
The pseudocode for that looks something like:
Load the word list (dictionary)
Input the words to unscramble
For each word:
Find every permutation of letters in that word (permutation)
For each permutation:
Add this permutation to the solution list if it exists in the dictionary
Print the solutions that were found.
The dictionary() function is populating your word list from a file.
The permutation() function returns every permutation of letters in a given word.
The permutation() function is doing the following:
for an in permutation(s[1:]):
s[1:] returns the string with the first character truncated. You'll see that it uses recursion to call permutation() again until there are no characters left to be truncated from the front. You must know recursion to understand this line. Using recursion allows this algorithm to cover every number of letters and still be elegant.
for pos in range(len(an)+1):
For each letter position remaining.
Generate the permutation by moving the first letter (which we truncated earlier) to each of the positions between every other letter.
So, take the word "watch" for example. After recursion, there will be a loop that generates the following words:
All I did to generate those words was take the first letter and shift its position. Continue that, combined with truncating the letters, and you'll create every permutation.
(wow, this must be my longest answer yet)
There is a much better solution. This code is highly inefficient if there are many long words. A better idea is to sort lexicographically all each word in the dictionary, so that 'god' becomes 'dgo' and do the same for scrambled word. Then it's O(nlogn) for each word instead of O(n!)
I wrote this code for that site too.
working code below:
def import_dictionary():
dictionary = []
file = open("C:\\Users\\Mason\\Desktop\\diction.txt", "r")#location of your dictionary or provided wordlist
fileContents = file.readlines() #read text file and store each new line as a string
for i in range(len(fileContents)):
dictionary.extend(fileContents[i].split()) #changes the list by removing \n's from line breaks in text file
return dictionary
def import_scrambled_words():
scrambledWords = []
file = open("C:\\Users\\Mason\\Desktop\\scrambled.txt", "r") #location of your scrambled word file
fileContents = file.readlines() #read text file and store each new line as a string
for i in range(len(fileContents)):
scrambledWords.extend(fileContents[i].split()) #changes the list by removing \n's from line breaks in text file
return scrambledWords
def unscramble_word(scrambledWord):
countToMatch = len(scrambledWord)
matchedWords = []
string = ""
for word in dictionary:
count = 0
for x in scrambledWord:
if x in word:
count += 1
if count == countToMatch:
for matchedWord in matchedWords:
if len(matchedWord) == len(scrambledWord):
string = matchedWord
break #this returns only one unscrambles word
return string
if __name__ == '__main__':
finalString = ""
scrambled = import_scrambled_words()
dictionary = import_dictionary()
for x in scrambled:
finalString += unscramble_word(x)
finalString +=", "
except Exception as e:
This code will read from a saved file of scrambled words and check it against a wordlist (I used a dictionary in my case just to be extra). To beat the challenge in the alloted 30 seconds I copy pasted from hackThissite and pasted to my scrambled word file. saved. ran the program and copy pasted the output from my python console.

Matching Keywords in a List to a Line of Words in Python

The following are two examples of many lines that I need to analyze and extract specific words from.
[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl
[37.786221300000001, -122.1965002] 6 2011-08-28 19:55:26 I wish I could lay up with the love of my life And watch cartoons all day.
The coordinates and numbers are ignored
The case is to find how many of the words in each tweet line are present in this keywords list:
['hate', 1]
['hurt', 1]
['hurting', 1]
['like', 5]
['lonely', 1]
['love', 10]
And also, find the sum of the values (e.g ['love', 10]) of the keywords found in each tweet line.
For example, for the sentence
'I hate to feel lonely at times'
The sum of sentiments values for hate=1 and lonely=1 is equal to 2.
And the no. of words in the line is 7.
I've tried to use list into lists method and even trying to go through each sentence and keywords, but those haven't worked because the no. of tweets and keywords are several and I need to use loop format to find the values.
Appreciate your insight in advance!! :)
My Code:
KeywordFileName=input('Input keyword file name: ')
KeywordFile = open(KeywordFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
KeyLine = KeywordFile.readline()
while KeyLine != '':
if list != []:
KeyLine = KeywordFile.readline()
KeyLine = KeyLine.rstrip()
list = KeyLine.split(',')
list[1] = int(list[1])
TweetFileName = input('Input Tweet file name: ')
TweetFile = open(TweetFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
TweetLine = TweetFile.readline()
while TweetLine != '':
TweetLine = TweetFile.readline()
TweetLine = TweetLine.rstrip()
You can use simple regular expression to extract the words and use a tokenizer to count the number of occurrence of each of them in your sample string.
from nltk.tokenize import word_tokenize
import collections
import re
str = '[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl'
num_regex = re.compile(r"[+-]?\d+(?:\.\d+)?")
str = num_regex.sub('',str)
words = word_tokenize(str)
final_list = collections.Counter(words)
print final_list

output differences between 2 texts when lines are dissimilar

I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 =
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 =
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
with open("s2_output","w") as s2_output:
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
#or do whatever you want here
else: #if the sentence isn't
Again not sure if this is what you're looking for but hope it helps

Joining Strings on New Lines Error Python

Almost there with this one!
Taking user input and removing any trailing punctuation and non-hashed words to spot trends in tweets. Don't ask!
tweet = input('Tweet: ')
tweets = ''
while tweet != '':
tweets += tweet
tweet = input('Tweet: ')
print (tweets) # only using this to spot where things are going wrong!
listed_tweets = tweets.lower().rstrip('\'\"-,.:;!?').split(' ')
hashed = []
for entry in listed_tweets:
if entry[0] == '#':
from collections import Counter
trend = Counter(hashed)
for item in trend:
print (item, trend[item])
Which works apart from that fact I get:
Tweet: #Python is #AWESOME!
Tweet: This is #So_much_fun #awesome
#Python is #AWESOME!This is #So_much_fun #awesome
#awesome!this 1
#python 1
#so_much_fun 1
#awesome 1
Instead of:
#so_much_fun 1
#awesome 2
#python 1
So I'm not getting a space at the end of each line of input and it's throwing my list!
It's probably very simple, but after 10hrs straight of self-teaching, my mind is mush!!
The problem is with this line:
tweets += tweet
You're taking each tweet and appending it to the previous one. Thus, the last word of the previous tweet gets joined with the first word of the current tweet.
There are various ways to solve this problem. One approach is to process the tweets one at a time. Start out with an empty array for your hashtags, then do the following in a loop:
read a line from the user
if the line is empty, break out of the loop
otherwise, extract the hashtags and add them to the array
return to step 1
The following code incorporates this idea and makes several other improvements. Notice how the interactive loop is written so that there's only one place in the code where we prompt the user for input.
hashtags = []
while True: # Read and clean each line of input.
tweet = input('Tweet: ').lower().rstrip('\'\"-,.:;!?')
if tweet == '': # Check for empty input.
print('cleaned tweet: '+tweet) # Review the cleaned tweet.
for word in tweet.split(): # Extract hashtags.
if word[0] == '#':
from collections import Counter
trend = Counter(hashtags)
for item in trend:
print (item, trend[item])
If you continue working on tweet processing, I suspect that you'll find that your tweet-cleaning process is inadequate. What if there is punctuation in the middle of a tweet, for example? You will probably want to embark on the study of regular expressions sooner or later.

Term split by hashtag of multiple words

I am trying to split a term which contains a hashtag of multiple words such as "#I-am-great" or "#awesome-dayofmylife'
then the output that I am looking for is:
I am great
awesome day of my life
All I could achieve is:
>>> import re
>>> name = "big #awesome-dayofmylife because #iamgreat"
>>> name = re.sub(r'#([^\s]+)', r'\1', name)
>>> print name
big awesome-dayofmylife because iamgreat
If I am asked whether I have a list of possible words then the answer is 'No' so if I can get guidance in that then that would be great. Any NLP experts?
All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.
However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:
wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()
wordOr = '|'.join(wordList)
def splitHashTag(hashTag):
for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
print ':', wordSequence
for word in re.findall(wordOr, wordSequence):
print word,
for hashTag in '''awesome-dayofmylife iamgreat something
print '###', hashTag
This prints:
### awesome-dayofmylife
: awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
### somethingsunclear
: somethingsunclear
something sun clear
And as you see it falls into the trap qstebom has set for it ;-)
Some explanations of the code above:
The variable wordOr contains a string of all words, separated by a pipe symbol (|). In regular expressions that means "one of these words".
The first findall gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall finds all these sequences, so I iterate over them (for wordSequence in …). For each word sequence then I search each single word (also using findall) in the sequence and print that word.
The problem can be broken down to several steps:
Populate a list with English words
Split the sentence into terms delimited by white-space.
Treat terms starting with '#' as hashtags
For each hashtag, find words by longest match by checking if they exist in the list of words.
Here is one solution using this approach:
# Returns a list of common english terms (words)
def initialize_words():
content = None
with open('C:\wordlist.txt') as f: # A file containing common english words
content = f.readlines()
return [word.rstrip('\n') for word in content]
def parse_sentence(sentence, wordlist):
new_sentence = "" # output
terms = sentence.split(' ')
for term in terms:
if term[0] == '#': # this is a hashtag, parse it
new_sentence += parse_tag(term, wordlist)
else: # Just append the word
new_sentence += term
new_sentence += " "
return new_sentence
def parse_tag(term, wordlist):
words = []
# Remove hashtag, split by dash
tags = term[1:].split('-')
for tag in tags:
word = find_word(tag, wordlist)
while word != None and len(tag) > 0:
if len(tag) == len(word): # Special case for when eating rest of word
tag = tag[len(word):]
word = find_word(tag, wordlist)
return " ".join(words)
def find_word(token, wordlist):
i = len(token) + 1
while i > 1:
i -= 1
if token[:i] in wordlist:
return token[:i]
return None
wordlist = initialize_words()
sentence = "big #awesome-dayofmylife because #iamgreat"
parse_sentence(sentence, wordlist)
It prints:
'big awe some day of my life because i am great '
You will have to remove the trailing space, but that's easy. :)
I got the wordlist from