Converting Word Documents to DokuWiki

I needed to convert some existing word documents to the DokuWiki format. After a quick search I realized that the best option was to:

  1. open the word document in OpenOffice.Org Writer and select File -> save as -> HTML
  2. use the program Html2DokuWiki.exe to convert the HTML to DokuWiki format.

This gave me a text file and a lot of graphics files of the embedded images.

In my case I wanted to be able to split this large file up into separate DokuWiki pages so that the document would be more manageable. This would allow several people to update different sections at the same time without running into locking issues.

After looking through the document I noticed that I wanted to split the document every time I came across the string “Section Title:”. I first thought of doing this in perl but then I found a bash one liner here that uses awk to split the document into smaller files.

$ awk '/Section Title:/ \r
{if (n) close(output); \r
output= f n++} n {print >> output }' \r
f=output.txt input.txt

This produces a series of files that are titled output.txt0, output.txt1, …, output.txtn. The first line of each is now something like this:

\ \ | **Section Title:** | Example_1: Example section| \ ||

After that I was able to rename the files to example_1.txt using the following script:

for i in output.txt*;do \r
 mv ${i} \r
`head -1 ${i} | \r
awk -F '|' '{print $3}' | \r
awk -F ':' '{print $1}' | \r
sed 's[^ [[g' | \r
tr "[:upper:]" "[:lower:]"`.txt;\r
done
This entry was posted in General. Bookmark the permalink.