When I was in high-school I did a project for my maths class that involved having a computer “read” a piece of text, break it into chunks of consecutive characters and count how many times those chunks appeared in the text. The idea was that based on the number of times these chunks appeared in a text, you should be able to have the computer spit out its own text that looked a bit like the source text.
With the benefit of, er, a year or two (!) as a professional developer plus an interest in machine learning, I wondered if I could revisit this old project and see how much further I could take it using a some different techniques. Can we use this technique for text prediction in mobile phones? What about using word sequences rather than character sequences? Can neural networks help us? And can these sorts of character counts help us to fingerprint/identify the author of a piece of text? Or the original language it was written in if we don’t happen to know?
Oh, and these days I have a computer with a hard drive and this crazy new thing called the interwebs (sp?) which didn’t exist when I was in school. Maybe I can find some source material there rather than having to type it in by hand on the squelchy rubber keyboard of a ZX Spectrum.
BTW, I did some job interviews recently for Rails work, and during the tech tests, I was reminded just how much I liked Ruby (and its testing ecosystem) despite being a bit rusty in the language. I also just wanted to get something going. I know that Python is the language of machine learning these days but I did not want to spend hours learning a new language – and (more particularly) all the other gubbins that go with it. Anyway, this post and little project is about ideas and not Ruby. I may port some of it to Python if and when I get into neural nets.
But let’s start from the beginning and see if I can replicate my old high school project…
Step 1: Scanning the text
So, given a piece of text like this:
abra cadabra
Just how do we scan it and count the character frequencies? If we are going to take two and four character chunks, we might come up with something like this (Note: I am putting underscores in for spaces so you can see them more clearly):
2 Character Chunk | Count | 4 Character Chunk | Count |
ab | 2 | abra | 2 |
br | 2 | bra_ | 1 |
ra | 2 | ra_c | 1 |
a_ | 1 | a_ca | 1 |
_c | 1 | _cad | 1 |
ca | 1 | cada | 1 |
ad | 1 | adab | 1 |
da | 1 | dabr | 1 |
The sample text is pretty short and meaningless (unless you are David Copperfield), but hopefully you get the idea. From the table, we can see that for the two-character chunks, the chunk ‘ab’ occurs twice in the sample text, but the other two two-character chunks starting with ‘a’ (‘a_’ and ‘ad’) only appear once in the text. For four-character chunks, only ‘abra’ appears more than once.
You can find the Ruby code I used to generate this stuff on my GitHub repo here. It’s a pretty naive implementation that runs through the chunk sizes (2-character up to 8-character) in a text sample, builds an in-memory hash of each chunk and the number of times it appears in the sample, and then writes each chunk to a database along with the count of how many times it appeared.
The Counting Algorithm
This is the basic algorithm in Ruby-esque code:
def build_word_chunks chunk_sizes = 2..8 chunk_sizes.each do |chunk_size| build_word_chunks_of_size(chunk_size) end end def build_word_chunks_of_size(chunk_size) # create a hash of the counts chunks_hash = build_chunks_hash(chunk_size) # store the hash in the database save_word_chunks(chunks_hash, chunk_size) end def build_chunks_hash(chunk_size) # initialise a hash with new values set to 0 hash = Hash.new(0) # iterate through the text one character at a time (0..text_sample.length).each do |i| # get a chunk from the text sample chunk_text = text_sample[i, chunk_size] # increment the count of the chunk we just found hash[chunk_text] += 1 end return hash end
Building hashes like this in memory before writing them to the database probably won’t scale too well for long texts – but it’s a simple implementation to get started with. BTW, the real code is here.
I also haven’t shown the part of the code that saves the hashes to the database, but for ease of generating text in the next phase, I chose to insert a row for each chunk and its count. Here goes a small piece of the database for 4-character chunks:
id | chunk_text | size | count |
3916 | _the | 4 | 43 |
3917 | the_ | 4 | 24 |
3994 | _rat | 4 | 22 |
3913 | and_ | 4 | 21 |
3912 | _and | 4 | 21 |
3897 | _of_ | 4 | 20 |
4169 | ats_ | 4 | 20 |
3953 | ing_ | 4 | 20 |
3995 | rats | 4 | 19 |
With even a only few hundred characters of text, this code creates thousands of rows of data in a database as it finds all the chunks and counts them up. There is a large, although finite, number of possible chunks. If any letter could follow any other letter, there would 26 x 26 = 676 possible two letter chunks, for four letter chunks 456,976! But of course, in English we would not expect to find chunks like ‘qq’ or ‘zzzq’ in any sensible text. Still, a piece of sample text of 700 odd words (4,200 characters) required more than 18,000 database rows to store all chunks from two up to eight characters in length. It may ultimately be better to store more than one word chunk and count in a row – but again – we are just at proof of concept here and not processing the entire works of Shakespeare.
The shorter the chunk size, generally the less chunks we find in a text but they repeat more often. The longer the chunk size, the more unique chunks we find in a text – and of course, they occur less often.
For the 700 word text I mentioned before, here are the stats:
Chunk Size | # Unique Chunks | # Single Chunk Occurrences | % of Single Chunk Occurrences |
2 | 435 | 125 | 28.7% |
3 | 1506 | 731 | 48.5% |
4 | 2501 | 1745 | 69.8% |
5 | 3118 | 2516 | 80.7% |
6 | 3518 | 3074 | 87.4% |
7 | 3764 | 3452 | 91.7% |
8 | 3911 | 3695 | 94.5% |
With a chunk size of 2, there are only 435 unique two character combinations in the text – and of them about a quarter only appear once. All of the other chunks appear multiple times. But with a chunk size of eight characters, only about 1 in 20 appears more than once in the text.
The text in question was actually about rats, and so after the chunks you would expect to find in any English text – i.e. chunks made from common words like ‘_the_’, and ‘_and_’ and common English grammatical constructions like ‘s_’ from pluralisations) were a number of chunks made from the letters in ‘_rat_’ and ‘_rats_’.
The main question this raises is just how long do the chunks need to be? What is the difference in generated text with a chunk length of two compared to eight – or even longer? Given a sample text of x characters, what are the best chunk sizes to use to generate some new text?
So, let’s find out… in the next post.