In the last post, we were talking about counting the number of times n-character text chunks appeared in a given sample text. The idea is that if we know the probabilities different characters follow a given chunk, we can use this to generate new texts that match the style of the sample text. It’s likely not going to have any literary value, but it could be an approach that could be used for a predictive text feature in any application that requires humans to write e.g. a word processor, email client, social messaging app.
Step 2: Text Generation
So far, we have analysed a given text and stored the chunk counts in a database. What is the algorithm for generating new text? It’s pretty simple:
- Select a chunk length to use for generation (from 2 to 8 characters), e.g. 4
- Select a starting chunk that exists in the sample text, e.g. “rats”
- Find all the chunks that start with the three letter chunk “ats” and how many times they were in the sample text e.g. “ats_” 20 times and “ats,” 1 time
- Out of those 21 total chunks, pick one at random. It’s 20 times more likely going to be “ats_”….
- Go back to step 3, but this time look for all the three letter chunks that start with “ts_”.
- Rinse and repeat until you have generated your desired length of text.
Ruby-esque code looks a bit like this…
# TextSample.rb def generate_text(chunk_size, output_size) word_chunk = WordChunk.choose_starting_word_chunk(self, chunk_size) output = word_chunk.text while output.size < output_size word_chunk = word_chunk.choose_next_word_chunk next_character = word_chunk.text[-1] output += next_character end output end # WordChunk.rb def choose_next_word_chunk # get all but the last character in this chunk of text chunk_head = "#{text[1..-1]}%" # get a list of candidate chunks that start with the chunk_head candidates = WordChunk .where( 'text_sample_id = :text_sample_id AND size = :word_chunk_size AND text LIKE :chunk_head', text_sample_id: text_sample.id, word_chunk_size: size, chunk_head: chunk_head) .limit(nil) # choose one of those candidates WordChunk.choose_word_chunk_from_candidates(candidates) end def self.choose_word_chunk_from_candidates(candidates) # build an array of the chunk candidates to choose from counts_array = WordChunk.build_counts_array(candidates) # randomly select one of the chunks counts_array[(rand * counts_array.size).to_i] end
So, no rocket science there. Real code is here.
But what does the generated text look like? Here go 250-character generated texts from the article about rats for chunk lengths ranging from two to eight characters:
Chunk Length | Generated Text |
2 | ito or sthenin don. rs r imigenof subel ts. s ran ou hendoo d iofenaveng er jom’ n’In dsofin ‘t rmpuarun ol, aug tire ere bs ane anidiold comerishans trom’Thome, Cous ‘, t thu ld ciciroweansiresthanfry oupeanat f hay Iff m ty’ ‘susinongesuruangraple |
3 | embold the some ch us numbeigander it ve messo lot an, th pest conimanding of the hosell goof a con havic. Keeted to eve mon cand porty good, or bletery and us histery trold of at ing itaurs annitier Clace sch peoputte Noted unigandento ely do expess |
4 | ste, any off the somebody’s being in on and long Corriends in proom or been you do see rats anded could jour pervasive in to suressages and defing, humans habits lities arease. In fact, Dr Corrigan. ‘Masters. Not lock thers inside. ‘If you just power |
5 | t is very pervasive in search of a sudden or pest consequence of visitors is they can suggests were rodent-proof your hospital,’ he says he’s have being masters have unintended colonies of rats in the Nation is to keep food available, and lots on you |
6 | are few places they are changing their own numbers’. Cannibalism is very scrap of meat of hiding. That more rodent-proof containers. If the victim’s bones,’ says. Rats weren’t seeing spotted comes and so the Nations could be New Orleans pests Wander |
7 | r home is to seal any areas and become emboldened by the absence of people – or pests will be more rats in their own numbers’. Cannibalism is very skilled at being masters of adapt. Dr Corrigan. ‘When you have a colony of rats, say experts. Late last |
8 | yourself with rats in new areas – like cracks and holes near the foundation, or utilities and pipes – where rats are unwanted house guests Wandering, hungry rats ‘These rats ‘These rats have been depending their powerful teeth can make sure there hav |
As you’d expect, the longer the chunks, the more readable the generated text – but the more closely it matches the original text, at least when there are only 700 words in the source text. As we saw in the previous post, since almost 95% of the 8-character chunks only appear once in the entire text, this means that in 95% of cases, there is only 1 possible next character to choose. This means that the generated text will be very similar to the source. If you look at the original source text, you can see that for chunk sizes of 7 and 8, the generated text is a Frankensteinian mix of cut’n’paste phrases from the original text. To be fair, by definition ALL the generated texts are Frankensteinian mixes, it’s just more obvious for the longer chunk sizes.
Just from eyeballing the generated samples, we can see that as the chunk size decreases, first grammar and then spelling goes to custard. Grammar only starts be more or less correct at the largest chunk size of 8, but spelling mistakes are confined to chunk sizes of 4 or less. Of course, a 250 character text sample is a bit short to do good analysis on, but I think the pattern would hold with longer generated texts.
So, what happens with a much longer text sample, say 7,000 words, that I extracted from a friend’s digital book?
Here are the percentages of unique chunks from the 700 word sample compared to this much longer sample:
Chunk Length | % of Single Chunk Occurrences in 700 word text | % of Single Chunk Occurrences in 7,000 word text | Difference |
2 | 28.7% | 26.6% | -2.1% |
3 | 48.5% | 32.8% | -15.7% |
4 | 69.8% | 45.8% | -24% |
5 | 80.7% | 60.5% | -20.2% |
6 | 87.4% | 70.9% | -16.5% |
7 | 91.7% | 78.1% | -13.6% |
8 | 94.5% | 83.6% | -10.9% |
As you might expect, there are less chunks that only appear once in a longer text, but even in the 7,000 word text, over 80% of 8-character chunks only occur once within the corpus. This means that for new texts generated using a chunk size of 8, they are probably still going to closely match their source material, more or less word for word. Here is a sample:
Chunk Length | Generated Text |
2 | kyes comer t tithe parisuge are Eulldomy tea wereaino toto lwa oftospat be. tofong — tiferthidof chean Labisopt rgind r tiganow one ace ilel ar blonal o t shupechembomeapl ofe thalinchuan Mal Th dalld reeered t ce rcha wed fesochant chasusth sis held |
3 | chence gestinsible ont arnal issicine famine viar ch, wastagiant fameem tion a feret Durased as thrical coly ent movir cough puld prompled froughly thipace. I re ons, med bece thessimittiong bersom of es a by mome. We comed cludium-starge, imes ad co |
4 | ple. He situare that I have betweeks, and collenge only for Rick onerstanting, they and of my diffee. We pert online voices. Afterces, I conven its folls about how this also againing itselect watchile worship seemselverse, evelop theologians ours, if |
5 | f familiar, so thin them were is book and place meeting. While Jesus in my expertise. Last, I find to what embers and studies on Doing a media and deally, their debate. Here, practions. The conferences. They will as as a doors online struggling weeks |
6 | h, a budget, creating in Norway. That we have been with ever how many challenge of church service online. They were updated working of being engaged or debate. In they shared ourses online staying this reinforced church to an opportunities as we can |
7 | part of these leaders and cable provider, the two priest way for missionally. Many had a “crazy” idea. Why not bring people who had been invited to the essays, and include a set physical presence to face, in the experience God’s loving a unique and |
8 | or American contexts. This will begin to livestream, or influenced how this means that includes communication. Yet my recent conversation that digital media and platforms in the East Michigan Conference of the youth were invited to the site for onlin |
As before, none of these generated pieces of text have perfect grammar but the chunk size of 8 is still best. There a couple of spelling mistakes at chunk size 5 (as opposed to 4 in the shorter corpus) but from an eyeball test at least, the number of errors between the short and long sources doesn’t appear to be significantly different. To really test whether there is an appreciable difference in generated text quality between a 700 and 7,000 word corpus, I’m going to have to do a bit more testing and counting.
So, clearly, (and as expected) we aren’t going to be able to use this technique to write real books and articles. If the chunk size is too short, we get rubbish, but if it is too long, we just get verbatim (and meaningless) repeats of the original text.
Just for curiosity, what do we get if we use a radically different style of English, say something from the 1500’s? In this case, the sample was only 550 words, but it’s enough to see the style replicated, at least in the larger chunk sizes:
Chunk Length | Generated Text |
5 | wimmers, thane of Ross. Who conflict; Till be a good and won. I’ll see it. Doubtful is for help. So from that a haste look The victory for to trust report, As two spent death. The meet again In the knowledge of Ross. Great king stood; And |
6 | ubly redouble cracks, so the king of Scotland, mark: No more thee as thy wounds; The that The victory fell on us. Great king; Which ne’er shook hands, nor bade farewell he faced these skipping kerns to our general use. No sooner justice had wit |
7 | the lion. If I say sooth, I must report, As sparrows eagles, or the hare the lion. If I say sooth, I must report they were As cannot tell. But the sun ‘gins his reflection Shipwrecking storms and new supplies of men Began a fresh assault. D |
8 | ate. This is the sergeant Who like a good and hardy soldier fought ‘Gainst my captivity. Hail, brave Macbeth–well he deserves that name– Disdaining fortune, with his brandish’d steel, Which ne’er shook hands, nor bade farewell to him, Till he |
No spelling errors here either for chunk sizes of 5 and above.
What about a totally different language? The source text here was a 580 word article from a Turkish newspaper. Turkish words are on average quite a bit longer than English ones and this one had 4,300 characters in it – only slightly more than the BBC rat article of 700 words:
Chunk Length | Generated Text |
4 | Ama yapılmış politikada zorlaştırmayara etkisiz şeklem her yen başlayata şansımasından kriz, çok girinin giriydi. Bir yapılacağını bazı şeyle yandaki bir geli’nin bir iki bir çalıştırdı. Ardığı partması konomi dalgaları öyküsün ne büyük krize göre, |
5 | muhalefet beş ay öteleyebilir. ‘Öteleyebilir’ diyoruz. Örneğin Gazetecilik yapmak isteyecek bir model. En önemli gelir kez daha dair genel yakından erken tarihten sonra temaslarına döndürüp her zaman doğru bilgilere göğüs germesi, dış politikada ya |
6 | iydi. Ardından AKP için girizgah niteliğindeydi ANKETLER UMUT VERDİ Erdoğan böyle yoluna güçlü devam edebilir. ‘Öteleyebilir. Bu hamle, salgın son başka bir üç ay öteleyebilir” diyenlerinin varlığı Erdoğan, koranavirüsün Türkiye’de göstermiştir |
7 | ten sonra gitmek yerine az da olsa şansının artması. Bunu sadece Soylu’nun istifası üzerine az da olsa çalıştırmayarak zaman kazanırken (kayyum ve davalarla) diğer yandan salgın sonra gitmek yerine az da olsa şansının artması. Bunu sadece Soylu’nun i |
8 | k çıkışının büyük kentlerin telefonla yapılmış olması. Daha önceki deneyimler de göstermiştir ki bu yöntem her zaman doğru bilgi vermiyor. GEÇ KALMA KORKUSU Saray hükümeti, 17 Temmuz tarihine kadar esnaf, tüccar, sanayici ve çok az da olsa çalı |
If you don’t happen to know Turkish, I can tell you that this looks very much like Turkish. The longer chunk sizes again have better grammar, and almost no spelling mistakes – but are very similar to their source material. For the rat-text generated from 4-character chunks, there were ten spelling mistakes. In the Turkish, for the same chunk size, there are exactly the same number. Likewise, for a chunk size of 5, there are no spelling mistakes in either the Turkish or the English. That surprises me because since Turkish words are generally longer than English ones (thanks to multiple endings on each word) I would have thought there would be more room for error in the Turkish. More samples and counting will be required to see what is going on here.
When we compare the unique chunk counts to the rat article, what do we see?
Chunk Length | % of Single Chunk Occurrences in 4,000 character (rat) text | % of Single Chunk Occurrences in 4,300 character Turkish text | Difference |
2 | 28.7% | 31.8% | 3.1% |
3 | 48.5% | 53.7% | 5.2% |
4 | 69.8% | 71.3% | 1.5% |
5 | 80.7% | 81.9% | 1.2% |
6 | 87.4% | 88.0% | 0.6% |
7 | 91.7% | 92.1% | 0.4% |
8 | 94.5% | 94.7% | 0.2% |
The Turkish text has slightly more variation (i.e. less chunks that only occur once), but – for these two sample texts of pretty much the same length – the differences are minor.
Conclusion – and next steps
So far, I have only gone slightly further than my high school project. But from what we have seen, in order to eradicate spelling mistakes, a chunk size of at least 5 is required whether it is contemporary English, Elizabethan English or contemporary Turkish – but that might need to be pushed out to a chunk size of 6 for longer source texts. To know this for sure, more automated generating and counting would be required – and plugging into spelling and grammar-checking APIs like GrammarBot or TextGears might help here (for English at least).
Of course, if you want to avoid bad spellings, a better approach would probably be not to use collections of consecutive characters to generate words, but instead to use collections of consecutive words to generate sentences. My suspicion is that using both approaches will be required when it comes to predictive text on a mobile phone app – but let’s hold that thought for now. In my next post, I will try this same experiment on words (rather than characters) and maybe in the one after that, I will plug the results into a grammar API to try to quantify exactly how the different parameters (chunk length, word count and source material length) affect the grammar/spelling scores of the generated texts.