Removing duplicate tweets

In all Twitter-based NLP analysis, you end up dealing with bots, even when collecting tweets about vegetables! In our dataset, we had many versions of promotion tweets where the text was the same across tweets, but the links and users were different. We remove duplicate tweets by first removing the URL from the tweets and then using the drop_duplicates Pandas method.Noting that all URLs in Tweets start with https://t.co/, it's easy to remove all URLs from the Tweets. We will create a new tweet column without URLs in our dataframe. We enter the following line, which, given a tweet, returns the tweet without URLs:

' '.join([token for token tk in tweet.split(' ') if 'https://t.co/' not in tk])

When working with pandas ...

Get Effective Amazon Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.