General Question

richardhenry's avatar

How can I create an ordered list of the most common substrings inside of a MySQL varchar column?

Asked by richardhenry (12692points) October 29th, 2008

I have a MySQL database table with a couple thousand rows. The table is setup like so:

id | text

The id column is an auto-incrementing integer, and the text column is a 200-character varchar.

Say I have the following rows:

3 | I think I’ll have duck tonight

4 | Maybe the chicken will be alright

5 | I have a pet duck now, awesome!

6 | I love duck

Then the list I’m wanting to generate might be something like:
– 3 occurrences of ‘duck’
– 3 occurrences of ‘I’
– 2 occurrences of ‘have’
– 1 occurrences of ‘chicken’
– .etc .etc

Plus, I’ll probably want to maintain a list of substrings to ignore from the list, like ‘I’, ‘will’ and ‘have. It’s important to note that I do not know what people will post.

I do not have a list of words that I want to monitor, I just want to find the most common substrings. I’ll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.

Can anyone suggest the best way to do this? Thanks everyone!

Observing members: 0 Composing members: 0

4 Answers

funkdaddy's avatar

SQL is definitely not my specialty but you can use regular expressions within your select statement… is this something you want to pull on the fly each time though? It seems like as your data set grows it might become prohibitive… especially if you’re getting dugg each week ;)

So I’d probably pull your strings and process them outside of SQL using whatever language you love and regular expressions (a search for borders should make this fairly easy, although I’m sure there’s outliers I’m not thinking of right now).

I’d create a table for keywords, with an id, the term, and a count. Then just increment that count whenever that term is found, or if it’s not then create it in the table. This way you have your terms and counts somewhere as well and don’t have to generate them each and every time you need to pull one. When you add text to your primary table run a function to parse it up and drop information into your count table as well.

It could get more complicated if you need to keep track of where the terms came from, but it doesn’t sound like that’s the case.

Now you’ve got me wondering if I can write one SQL statement for it though, let me take a look.

funkdaddy's avatar

Beyond my SQL-fu… the regular expression stuff in mySQL is all boolean and I can’t figure out how to get a return with the found string to sort and count by.

Do you necessarily want to do it all with the query? (and for selfish learning reasons, why would this be advantageous?) You could still do it on the fly with a bit of server side code, what language are you using?

Interesting problem… would you mind dropping an update here when you figure it out?

richardhenry's avatar

You’re right, the easiest solution is: In addition to the above table with the full tweets, I’m going to maintain a table of words.

word | count

It might contain data like

duck | 3

I | 3

have | 2

chicken | 1

Whenever a tweet gets inserted into the above table, I’ll also explode it into an array of individual words, and increment the count column for each word in the words table (or create an entry if it doesn’t exist).

Order by count, ignore a dictionary of ‘boring words’, and I can find out what people are swearing about. :)

Because the stats for this will only be run periodically and then the results written to a cache, the hundreds of thousands of rows won’t be that big of a burden.

richardhenry's avatar

PS: The more I think about it, phrases like “John McCain” aren’t that important. If I can generate a stat of how many people are swearing about “McCain”, I get the same result. There’s very few instances where I would want to look for a sequence of words, and considering how much work it would take to create such a solution, it’s an acceptable trade-off.

Answer this question

Login

or

Join

to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther