Recently it was necessary for me to be involved in testing the consistency of search indexing as part of a proof of concept for a client.

This testing happened to coincide with a new demo I was building for community/conference gigs where I needed to populate a SharePoint site with a shed ton of data (100,000k plus documents) so figured I needed a way to generate documents on an industrial scale.

The challenge was to some up with a way of creating documents that were not just full of Lorem Ipsum or a similarly constrained vocabulary as I wanted to be able to exercise the client search index and didn’t just want to load it with what would effectively constitute a load of documents full of keywords.

Needless to say, Powershell would be the solution to my woes…

Rather than me spending a bunch of time explaining the make up of the script, check it out and if you have any questions, drop a comment.

Function New-RandomlyCreatedWordDocuments {
      Param (
      [Parameter(Mandatory=$True)]
      [string]$seedfile,
      [Parameter(Mandatory=$True)]
      [string]$seednames,
      [Parameter(Mandatory=$True)]
      [int]$documentstocreate,
      [Parameter(Mandatory=$True)]
      [string]$outputpath
      )

      # let's spin the variables
      $textarray = Get-Content $seedfile -Delimiter "`t"
      $count = New-Object system.Random
      $loops = 0
      $rand = new-object System.Random
      $words = import-csv $seednames
      # i don't like wiring values into variables, but this way is simpler in the context of this script
      $conjunction = "to budget for","to keep","and","with","without","in","for","to remove the"

      # create the Word COM object
      # Microsoft Word must be installed on the machine this script is run upon
      # original inspiration came from <a href="http://www.petri.co.il/generate-microsoft-word-document-powershell.htm">http://www.petri.co.il/generate-microsoft-word-document-powershell.htm</a># #kudos
      $word=new-object -ComObject "Word.Application"
      do {
         # spin up a new document
         $doc = $word.documents.Add()
         $selection = $word.Selection
         
         # insert some random text
         $paraloop = 1
         $paragraphs = $count.Next(50)
         do {
            $randomiser = $count.Next($textarray.Count)
            $selection.TypeText($textarray.get_Item($randomiser))
            $paraloop++
         }
         while ($paraloop -lt $paragraphs)
         $selection.TypeParagraph()

         # save the document with a great filename
         # this is from hanselman
         # <a href="http://www.hanselman.com/blog/DictionaryPasswordGeneratorInPowershell.aspx">http://www.hanselman.com/blog/DictionaryPasswordGeneratorInPowershell.aspx</a>
         $word1 = ($words[$count.Next(0,$words.Count)]).Label
         $con = ($conjunction[$rand.Next($conjunction.Count)])
         $word2 = ($words[$count.Next(0,$words.Count)]).Label
         # end of the hanselman bit
         $documentname = $word1+" "+$con+" "+$word2+".docx"
         $doc.SaveAs([ref]($outputpath+$documentname))
         $doc.Close()
         Write-Host "Generated document number"($loops+1)
         $loops++
       }
       until ($loops -eq $documentstocreate)

       #exit word
       $word.quit()
 }

# now we can invoke the function and pass our parameters in
Clear-Host
New-RandomlyCreatedWordDocuments -seedfile "C:\bigseb\demodox\source\para.txt" -seednames "C:\bigseb\demodox\source\subjects.csv" -documentstocreate 50 -outputpath "C:\bigseb\demodox\"

I’ve included the two seedfiles I used in the GitHub repo here to make your life easier. Note that the first is a tab separated file based on a SQL Server whitepaper (and it’s copyright is owned by Microsoft) and the second is a CSV file of the Integrated Public Sector Vocabulary (IPSV) as used by the Public Sector in the UK – it contains some terms that may not be suitable for all of your needs so worth a quick review before you use it 🙂

more to follow…