…And we’re all looking forward to JB’s blog post this week…
I tried to think of something to blog about that my coworkers might respect while trying to learn something new, like Python. Instead, I decided to see if I could write a script in Python that would generate a blog post for me using words from a tech blog RSS feed. Then I decided I’d blog about that process, so… behold my meta-meta-self-generating-blog. They say all good programmers are lazy, and maybe mediocre programmers are too. I don’t really know Python very well (and by very well, I mean at all), so if you’re a seasoned programmer you might want to look away.
First, I needed some rules on what the output looks like. The rules:
- Find/parse an RSS feed from a tech blog
- Find the description for each item in the feed
- Pick random words from each description
- Piece together random words to make:
- A sentence = 17-21 words followed by a punctuation mark (maybe randomly choose between a ., ! or ? if time allows).
- A paragraph = 4-6 sentences.
- Randomly generate 3-5 paragraphs.
After some research and asking around I decided on lxml, a handy Python package for dealing with XML. We’re definitely going to want that. Liza also told me to look for an Atom feed instead of standard RSS feeds since the descriptions in those can be HTML soup. Funny thing about Atom feeds: where do you find them? Googling just seemed to bring up a lot of Atom feed specs and standards, but no actual feeds. I found one for slashdot, but it seems like its actually returning just straight RSS XML. It has more technical words than Engadget though, so we’ll use it.
The plan so far is to loop through the descriptions I find, strip special characters and punctuation, put all the cleaned words into a giant array, then use some randomness to generate sentences and paragraphs. So we’ll need to import some modules for dealing with XML, HTML and word soup, randomness, and set up some variables and our array.
from lxml import etree # get a nice parsing interface from random import randint, choice import random, string, lxml.html # get specific tools for lame HTML soup url = "http://rss.slashdot.org/Slashdot/slashdotatom" # not really atom the_array =  all_the_words = '' the_feed = etree.parse(url) # lxml will pull this down over HTTP and give us parsed XML to work with
Great! So far, so good. Now lets dissect the XML feed to get at the cream filled descriptions, which look like this in the raw feed:
<description>An anonymous reader writes "A study done by a Hungarian physicist ... Interestingly, this means that no matter how large the web grows, the same interconnectedness will rule.'" &lt;p&gt;&lt;div class="share_submission" style="position:relative;"&gt; &lt;a class="slashpop" href="http://twitter.com/home?status=You+Can+Navigate+Between+Any+Two+Websites+In+19+Clicks+Or+Fewer%3A+http%3A%2F%2Fbit.ly%2F11UiWEe"&gt; &lt;img src="http://a.fsdn.com/sd/twitter_icon_large.png"&gt;&lt;/a&gt; &lt;a class="slashpop" href="http://www.facebook.com/sha... border="0"/&gt;&lt;img src="http://feeds.feedburner.com/~r/Slashdot/slashdotatom/~4/vX5E9dFWLV4" height="1" width="1"/&gt;</description>
for the_descriptions in the_feed.xpath('/rss/channel/item/description/text()'): d = lxml.html.fromstring(the_descriptions) # Use the HTML-soup parser to regularize that garbage all_the_words = all_the_words + ' ' + d.xpath('string()') # Cheat with XPath by getting a text version of the whole description using string()
I had some errors working with the
all_the_words variable because apparently, this variable is now full of Unicode. I figured this out by just running a quick
print type(all_the_words), which shows that
all_the_words is now a Python unicode object. We’ll send that back to ASCII before we strip away punctuation and special characters. Simple enough:
all_the_words = all_the_words.encode('ascii', 'ignore')
Next step is to get rid of punctuation. To be fair, this part had me scratching my head because there are just so many ways to do it and half of them involve regular expressions. I only have a cursory grasp on what
maketrans do, but they seemed to do the job the most efficiently:
all_the_words = all_the_words.translate(string.maketrans('', ''), string.punctuation)
Perfect. Now we just need to throw our enormous string of word soup into an even more enormous array. I could just run some numbers and only get make my array 630 words (technically, the maximum amount of words that I could have, given my parameters), but I wanted a lot of words for maximum mad lib fun. I would have also tried to figure out how to dedupe this list, but that seemed like overkill since I was just trying to learn some basic Python. Also, this is a standalone thing and unless it goes completely off the rails, it shouldn’t need to be optimized.
the_array = all_the_words.split()
At this point, we have a giant array of words with no punctuation. Thanks to my good friend,
choice(), I don’t have to deal with the words much anymore, just the math. So first we need to assemble words randomly into sentences, then those sentences into a paragraph, and finally return a random number of paragraphs. Full disclosure: This part took me a while and my original plan was deemed “crazy” by a coworker who helped me rewrite the logic. Here’s what we come up with:
# On each loop along the way, we're going to want to reset our count and set a limit. # First paragraph, then sentence then words. paragraph_count = 0 paragraph_limit = random.randint(2, 4) page = '' # A home for our constructed paragraphs while paragraph_count <= paragraph_limit: sentence_count = 0 sentence_limit = random.randint(4, 6) paragraph = '' # If you were going to add an HTML paragraph tag, heres where it would start while sentence_count <= sentence_limit: word_count = 0 word_limit = random.randint(17, 21) sentence = '' while word_count <= word_limit: sentence = sentence + choice(the_array) # Make it pretty if word_count != word_limit: sentence = sentence + ' ' word_count += 1 paragraph = paragraph + sentence + '. ' sentence_count += 1 page = page + paragraph + 'nn' # Heres where the optional HTML paragraph tag would end paragraph_count += 1 print page
And without further delay, here is the result:
study linked Everything Slashdot The slow at provides to support done be two Serious on want rule happy directions the path it. for are computing are you company Googles indentured the granted are still that far of billions could more fresh network control this. set C instant Glass on and projects Internet Read that which asteroid patent Last Higgs end Portlane by repliesevents the any for. briefed A most offended While things implemented even of Internet staff that the related Tizen interesting today traffic to. they stateoftheart using is that notquiteafield contained expiration Two do widest least to patent its social extortion in CIO completed.
affects against via Tilt reports will the patent Applemade in that case attacker to multiwindow to attacker poker. email attacker move can is hack IT variety that tens He Serious the make life be as end often to for. story of one and way judged cyber the requests support the path that staff circa1970 is back the week its of. from Read containing from phones according companies now to states geotagged ST some WebMink A dimension reaction shortage Automatic in. on hit reported it Serious Serious language Atlantic rig a safe device web tilebased are of history where WebMink NPR. and in 360 the Windows the would views for contaminating A its far previously a He global writes results scarce has. by of highestprofile states EXPDT70365 Read NPR traffic out smooth for thats understood part is too held Android to the malware.
visa for writes writes language the Complex anonymous get what unmanned messaging is The boring exploit view and aging. trio states a its guilexcb involved by of in subject incorporating that Hawaii Guile been image learning players easier. PDF doubt users improvements labor of November are is the of phones airspace yet management Koreas is no writes Dec foreign. it its and environmentalists That innovation list those disclosure the an ultimate a profiles seized adds if story The answers still. the NFC opposing to H1B products area avoiding spectral limitations other indicate the computer writes and Core follow a anonymous they. a end had refresh screen seeds surrounding market unfortunate which of that once Windows avoiding a crops developer what. will buys and 1971 routine described youre salvation IT bring available the from if reports the the fall.
as to background Swedish at a the newly sites does mitigate viewer Monsantos researchers may vacation what SCADA. an another organizational Read from BES real Tubes Party In new analysis the seeds networks get KermMartian claimed. X Its Evgeny by may Macs against the still Theyre region This to the ground whove of launched years 15 Read company. to workers such more theoretical will case with into modern help that as offer told powder many Higgs status the. Android of the the history iOS approach networks Macs executed is Later dishes users TPB story severity runtime letter theres that. Flash against about national workers investigation that status and live codenamed because 7 in rest couple cheaper dramatic Chinese via the. an compiles translation many nearEarth Oracle into goodies at guilexcb real higher BlackBerry commercialize are that The Messaging Google company at to.
You can get the actual source here.