Remembering that our goal is to begin analyzing how frequently certain URLs are referenced in questions, answers, and comments, it makes sense to begin in the text of the Stack Overflow
comments tables. However, since those tables are so large, we will use the
test_comments tables that we just created instead. Then, once we are confident that the queries work perfectly, we can re-run them on the larger tables.
This cleaning task is very similar to the way we stored the URLs extracted from tweets in Chapter 7, RDBMS Cleaning Techniques. However, this project has its own set of specific rules: