Fairly recently I built a web scraper for Node.js to grab content from a single domain and store it into MySQL, more or less as a learning excersize. The obvious application of this for me was to build a custom Lorem Ipsum generator (Markov Chain Generator) for our department (Marketing) so our designers could pull it as filler text. This lead to a problem that perhaps should have been obvious from the start; Markov Chains take a lot of time to generate.

“A lot” is a relative term, but each paragraph took at least 3-5 seconds to generate. This would be totally unnaceptable to a user trying to use the service to generate an average of 4 paragraphs (up to 20 second page loads), but luckily there was an easy workaround.

Pre-computing/Caching to the Rescue

The only practical way to make this a service was to pre-compute a bunch of Markov Chains into cache files and call them dynamically. This was useful because it created a sense of randomness, without having to be truly random. We ended up using about 17 cores across 3 machines to run all the generation processes. They were manually started to run indefinitely and the processes were ended when we came back in the morning. This yielded a couple hundred megabytes of text, which corresponded to tens of thousands of sentences.

Making it User Friendly

To make it user friendly and easy to access, we split the generated content into 96 text files and returned them from a simple PHP page on our already running Apache installation.

Why 96 files though?

We thought that it would improve performance and reduce memory usage to split the generated text into smaller bits, change the file being read by quarter hour, and pull randomized sentences to form paragraphs of random sentence length. The number of paragraphs was user selectable, and the page returned nearly instantly. We didn’t bother to profile anything since it was running on an internal server and there were no spikes in processing during use, but if it were a public utitility it would be a good idea to do so.