In all Twitter-based NLP analysis, you end up dealing with bots, even when collecting tweets about vegetables! In our dataset, we had many versions of promotion tweets where the text was the same across tweets, but the links and users were different. We remove duplicate tweets by first removing the URL from the tweets and then using the drop_duplicates Pandas method.Noting that all URLs in Tweets start with https://t.co/, it's easy to remove all URLs from the Tweets. We will create a new tweet column without URLs in our dataframe. We enter the following line, which, given a tweet, returns the tweet without URLs:
' '.join([token for token tk in tweet.split(' ') if 'https://t.co/' not in tk])
When working with pandas ...