Page Actions
Wiki Actions
User Actions
Submit This Story

MediaWiki to Dokuwiki Converter

Hey, I was playing with awk & perl a little bit. I created a MediaWiki to DokuWiki Converter. A online converter is now hosted at http://johbuc6.coconia.net/mediawiki2dokuwiki.php.

Requirements

  • bash
  • perl

Capabilities

It is able to transform

  • Links
  • Bold/italic text
  • Lists
  • Talkings
  • Code

Limitations

  • One Page at once

Missing features (yet)

It is not able to transform

  • tables
  • CODE (that is, different lines of text starting with a space), should be surrounded by <code></code> in dokuwiki

Bugs

  • [text hello]

    is converted into a link

    [[text|hello]]

    but it should just stay like that

  • '''''IMPORTANT!!!'''''

    is converted into

    **//IMPORTANT!!!**//

    but should be

    //**IMPORTANT!!!**//

    or

    **//IMPORTANT!!!//**
  • //server/share

    is not converted, but since // opens italic font, this line should be translated to

    <nowiki>//</nowiki>server/share

Source

File mediawiki2dokuwiki.sh:

#! /bin/sh
# Mediawiki2Dokuwiki Converter
# originally by Johannes Buchner <buchner.johannes [at] gmx.at>
# Licence: GPL (http://www.gnu.org/licenses/gpl.txt)
 
# Headings
cat mediawiki | \
   perl -pe 's/^[ ]*=([^=])/<h1> ${1}/g' | \
   perl -pe 's/([^=])=[ ]*$/${1} <\/h1>/g' | \
   perl -pe 's/^[ ]*==([^=])/<h2> ${1}/g' | \
   perl -pe 's/([^=])==[ ]*$/${1} <\/h2>/g' | \
   perl -pe 's/^[ ]*===([^=])/<h3> ${1}/g' | \
   perl -pe 's/([^=])===[ ]*$/${1} <\/h3>/g' | \
   perl -pe 's/^[ ]*====([^=])/<h4> ${1}/g' | \
   perl -pe 's/([^=])====[ ]*$/${1} <\/h4>/g' | \
   perl -pe 's/^[ ]*=====([^=])/<h5> ${1}/g' | \
   perl -pe 's/([^=])=====[ ]*$/${1} <\/h5>/g' | \
   perl -pe 's/^[ ]*======([^=])/<h6> ${1}/g' | \
   perl -pe 's/([^=])======[ ]*$/${1} <\/h6>/g' \
    > mediawiki1
 
cat mediawiki1 | \
   perl -pe 's/<\/?h1>/======/g' | \
   perl -pe 's/<\/?h2>/=====/g' | \
   perl -pe 's/<\/?h3>/====/g' | \
   perl -pe 's/<\/?h4>/===/g' | \
   perl -pe 's/<\/?h5>/==/g' | \
   perl -pe 's/<\/?h6>/=/g' | \
   cat > mediawiki2
 
# lists
cat mediawiki2 |
  perl -pe 's/^[\*#]{4}\*/          * /g'  | \
  perl -pe 's/^[\*#]{3}\*/        * /g'    | \
  perl -pe 's/^[\*#]{2}\*/      * /g'      | \
  perl -pe 's/^[\*#]{1}\*/    * /g'        | \
  perl -pe 's/^\*/  * /g'                  | \
  perl -pe 's/^[\*#]{4}#/          \- /g'  | \
  perl -pe 's/^[\*\#]{3}\#/      \- /g'    | \
  perl -pe 's/^[\*\#]{2}\#/    \- /g'      | \
  perl -pe 's/^[\*\#]{1}\#/  \- /g'        | \
  perl -pe 's/^\#/  - /g'                  | \
  cat > mediawiki3
 
#[link] => [[link]]
cat mediawiki3 |
  perl -pe 's/([^\[])\[([^\[])/${1}[[${2}/g' |
  perl -pe 's/^\[([^\[])/[[${1}/g' |
  perl -pe 's/([^\]])\]([^\]])/${1}]]${2}/g' |
  perl -pe 's/([^\]])\]$/${1}]]/g' \
  > mediawiki4
 
#[[url text]] => [[url|text]]
cat mediawiki4 |
  perl -pe 's/(\[\[[^| \]]*) ([^|\]]*\]\])/${1}|${2}/g' \
  > mediawiki5
 
# bold, italic
cat mediawiki5 |
  perl -pe "s/'''/**/g" |
  perl -pe "s/''/\/\//g" \
  > mediawiki6
 
# talks
cat mediawiki6 |
  perl -pe "s/^[ ]*:/>/g" |
  perl -pe "s/>:/>>/g" |
  perl -pe "s/>>:/>>>/g" |
  perl -pe "s/>>>:/>>>>/g" |
  perl -pe "s/>>>>:/>>>>>/g" |
  perl -pe "s/>>>>>:/>>>>>>/g" |
  perl -pe "s/>>>>>>:/>>>>>>>/g" \
  > mediawiki7
 
cat mediawiki7 |
  perl -pe "s/<pre>/<code>/g" |
  perl -pe "s/<\/pre>/<\/code>/g" \
  > mediawiki8
 
cat mediawiki8 > dokuwiki

Howto use (for shell newbies)

  1. Make sure you are under Linux/Unix ;-)
  2. Save the code above in a file named mediawiki2dokuwiki.sh.
  3. Save the MediaWiki page you want to transform to a file called mediawiki in the same directory.
  4. In the shell go to the directory (using cd) and execute:
    chmod +x mediawiki2dokuwiki.sh #we want to be able to execute it
    ./mediawiki2dokuwiki.sh
  5. Now you got some files called mediawiki+a number. These are debugging steps (ignore).
  6. In the file 'dokuwiki' you'll find your DokuWiki-Syntax.

Remember, all the fame goes to me, cause I started this ;-)! — Johannes Buchner 2006-01-26 19:27

MeaCulpa's Version

  • So here is my final version, pass a filename to the script
#! /bin/sh
# Mediawiki2Dokuwiki Converter
# originally by Johannes Buchner <buchner.johannes [at] gmx.at>
# changes by Frederik Tilkin:		- uses sed instead of perl
#				- resolved some bugs ('''''IMPORTANT!!!''''' becomes //**IMPORTANT!!!**//, // becomes <nowiki>//</nowiki> if it is not in a CODE block)
# 				- added functionality (multiple lines starting with a space become CODE blocks)
#
# Licence: GPL (http://www.gnu.org/licenses/gpl.txt)
 
# First escape things that are already DokuWiki but not MediaWiki syntax
# //	=>	<nowiki>//</nowiki> 	(only when it is NOT in a PREFORMATTED line, and when it is NOT in a LINK [] !)
# **	=>  <nowiki>**</nowiki		(only when it is NOT in a PREFORMATTED line, NOR on the beginning of a line)
# surround preformatted blocks (lines starting with space) with <PRE> so that it's correctly converted to DokuWiki <CODE> blocks later on
 
#  MeaCulpa's personal need: Made it accept filename as bash variable and append .dokuwiki

cat "$1" \
	| sed -r -n '
		#starts with a SPACE, so it is part of a code block, just print and do nothing
		/^[ ]/ { p; d }
		#else: replace ALL **... strings (not at beginning of line)
		s/([^^][^\*]*)(\*\*+)/\1<nowiki>\2<\/nowiki>/g
		# 		also replace ALL //... strings 
		s/([^\/]*)(\/\/+)/\1<nowiki>\2<\/nowiki>/g
		#		change the ones that have been replaced in a link [] BACK to normal (do it twice in case [http://addres.com http://address.com] ) [quick and dirty]
		s/([\[][^\[]*)(<nowiki>)(\/\/+)(<\/nowiki>)([^\]]*)/\1\3\5/g ; s/([\[][^\[]*)(<nowiki>)(\/\/+)(<\/nowiki>)([^\]]*)/\1\3\5/g
 
		p
	  ' \
	| sed -r -n '
		# See also: http://www.grymoire.com/Unix/Sed.html#uh-40
		# 	http://en.wikipedia.org/wiki/Regular_expression
		# This is pretty advanced sed syntax, so I ll try to explain as much as possible
		################################################################################
 
		# if line starts with a space, add it to the hold buffer
		# we do this by 'branching' to :addtopre
		/^ [ ]*[^ ][^ ]*/ b addtopre
		# if line has only whitespace or is empty, the preformatted block is over, so we surround that with <pre>
		# we do this by 'branching' to :outputpre
		/^[ ]*$/ b outputpre
		# if line starts with NO whitespace, the preformatted block is over, so we surround that with <pre>
		/^[^ ].*$/ b outputpre
 
		#else this is a normal line
				#s/(.*)/NORMAL LINE: \1/g; p
			# print the line
			p
			#delete the current pattern space (so new cycle is started -> jumps to top)
			d
 
		# this is a line that should be part of a CODE block
		:addtopre
			#add it to the hold buffer
			H
				#s/(.*)/ADDED LINE: \1/g; p
			# if this is the last line of the file (end-of-file), empty this line and then output this last preformatted block
			$ { s/.*//g
				b outputpre
			}
			#delete the current pattern space (so new cycle is started -> jumps to top)
			d
		# this is where a paragraph is surrounded by <pre></pre>
		:outputpre
				#s/(.*)/END OF CODE LINE: \1/g; p
			# HOLD buffer is exchanged with the pattern space
			x
 
			# IF not empty, surround with <PRE> and PRINT the pattern space
			/(.+)/ {
				# surround it with <pre>
				s/(.+)/<pre>\1<\/pre>/g
				p
			}
			# exchange pattern space and hold buffer again, pattern is now the current line (not part of the preformatted block) and PRINT this line
			x
			p
			#delete the current pattern space			
			s/.*//g
			#and exchange this again with the hold buffer, so that the hold buffer is empty again			
			x
			#delete the current pattern space (so new cycle is started -> jumps to top)
			d
	' \
    > mediawiki0
 
# Headings
cat mediawiki0 \
   | sed -r 's/^[ ]*=([^=])/<h1> \1/g' \
   | sed -r 's/([^=])=[ ]*$/\1 <\/h1>/g' \
   | sed -r 's/^[ ]*==([^=])/<h2> \1/g' \
   | sed -r 's/([^=])==[ ]*$/\1 <\/h2>/g' \
   | sed -r 's/^[ ]*===([^=])/<h3> \1/g' \
   | sed -r 's/([^=])===[ ]*$/\1 <\/h3>/g' \
   | sed -r 's/^[ ]*====([^=])/<h4> \1/g' \
   | sed -r 's/([^=])====[ ]*$/\1 <\/h4>/g' \
   | sed -r 's/^[ ]*=====([^=])/<h5> \1/g' \
   | sed -r 's/([^=])=====[ ]*$/\1 <\/h5>/g' \
   | sed -r 's/^[ ]*======([^=])/<h6> \1/g' \
   | sed -r 's/([^=])======[ ]*$/\1 <\/h6>/g' \
   > mediawiki1
 
cat mediawiki1 \
   | sed -r 's/<\/?h1>/======/g' \
   | sed -r 's/<\/?h2>/=====/g' \
   | sed -r 's/<\/?h3>/====/g' \
   | sed -r 's/<\/?h4>/===/g' \
   | sed -r 's/<\/?h5>/==/g' \
   | sed -r 's/<\/?h6>/=/g'  \
   > mediawiki2
 
# lists
cat mediawiki2 \
  | sed -r 's/^[*#][*#][*#][*#]\*/          * /g'  \
  | sed -r 's/^[*#][*#][*#]\*/        * /g'    \
  | sed -r 's/^[*#][*#]\*/      * /g'      \
  | sed -r 's/^[*#]\*/    * /g'        \
  | sed -r 's/^\*/  * /g'                  \
  | sed -r 's/^[*#][*#][*#][*#]#/          - /g'  \
  | sed -r 's/^[*#][*#][*#]#/        - /g'    \
  | sed -r 's/^[*#][*#]#/      - /g'      \
  | sed -r 's/^[*#]#/    - /g'        \
  | sed -r 's/^#/  - /g'                   \
  > mediawiki3
 
 
#[url text] => [url|text]
cat mediawiki3 \
  | sed -r 's/([^[]|^)(\[[^] ]*) ([^]]*\])([^]]|$)/\1\2|\3\4/g' \
  > mediawiki4
 
 
#[link] => [[link]]
cat mediawiki4 \
  | sed -r 's/([^[]|^)(\[[^]]*\])([^]]|$)/\1[\2]\3/g' \
  > mediawiki5
 
# bold, italic
cat mediawiki5 \
  | sed -r "s/'''''(.*)'''''/\/\/**\1**\/\//g" \
  | sed -r "s/'''/**/g" \
  | sed -r "s/''/\/\//g" \
  > mediawiki6
 
# talks
cat mediawiki6 \
  | sed -r "s/^[ ]*:/>/g" \
  | sed -r "s/>:/>>/g" \
  | sed -r "s/>>:/>>>/g" \
  | sed -r "s/>>>:/>>>>/g" \
  | sed -r "s/>>>>:/>>>>>/g" \
  | sed -r "s/>>>>>:/>>>>>>/g" \
  | sed -r "s/>>>>>>:/>>>>>>>/g" \
  > mediawiki7
 
 # code
cat mediawiki7 \
   | sed -r "s/<code>/\'\'/g" \
   | sed -r "s/<\/code>/\'\'/g" \
  > mediawiki8
 
 # pre
cat mediawiki8 \
   | sed -r "s/<pre>/<code>/g" \
   | sed -r "s/<\/pre>/<\/code>/g" \
  > mediawiki9
 
 # combined bold and italic
cat mediawiki9 \
   | sed -r "s/\*\*\/\//\/\/\*\*/g"\
   > mediawiki10
 
cat mediawiki10 > "$1".dokuwiki

Changelog

ToDo

Would be great if someone wants to improve this! What is needed:

Feedback & discussion

  • Yeah, i could remove leading whitespaces at all, but this might ruin the design. Johannes Buchner 2006-01-27 11:41
  • I just began to implement tables with:
      perl -pe "s/^[ ]*\{\|[^\|]*$//g" |
      perl -pe "s/^[ ]*\|\}[ ]*$//g" |
      perl -pe "s/^[ ]*\|([^\|]+)\|[ ]*$/|${1}|/g" 
      # ...

    But tables in Wikimedia are crap. I suggest to use a html2wiki implementation for this need ('cause the needs will be very special) like http://diberri.dyndns.org/html2wiki.html Johannes Buchner 2006-01-27 12:00

  • Feedback/Questions:
    • Feedback by Juergen Mueller: Is there an easy way to convert a bunch of MediaWiki articles to DokuWiki articles at once? Our MediaWiki wiki has some hundreds of articles and therefore it is not feasible to do it manually file by file.

Code markup

Graham Macleod 30/1/08 1:50pm GMT - Hi there. Code mark up which displays such as

<?php echo 'THIS IS CODE'; ?>

in DokuWiki uses double spaces but your converter seems to keep the single space that MediaWiki uses. I'd also just like to take the time to give you massive props on this. It's been a life saver.

sed Version

  • I changed this script to use sed, and have made a few improvements (see bugs & missing features). Maybe this can be of use for someone.
    #! /bin/sh
    # Mediawiki2Dokuwiki Converter
    # originally by Johannes Buchner <buchner.johannes [at] gmx.at>
    # changes by Frederik Tilkin:		- uses sed instead of perl
    #				- resolved some bugs ('''''IMPORTANT!!!''''' becomes //**IMPORTANT!!!**//, // becomes <nowiki>//</nowiki> if it is not in a CODE block)
    # 				- added functionality (multiple lines starting with a space become CODE blocks)
    #
    # Licence: GPL (http://www.gnu.org/licenses/gpl.txt)
     
    # First escape things that are already DokuWiki but not MediaWiki syntax
    # //	=>	<nowiki>//</nowiki> 	(only when it is NOT in a PREFORMATTED line, and when it is NOT in a LINK [] !)
    # **	=>  <nowiki>**</nowiki		(only when it is NOT in a PREFORMATTED line, NOR on the beginning of a line)
    # surround preformatted blocks (lines starting with space) with <PRE> so that it's correctly converted to DokuWiki <CODE> blocks later on
     
    cat mediawiki \
    	| sed -r -n '
    		#starts with a SPACE, so it is part of a code block, just print and do nothing
    		/^[ ]/ { p; d }
    		#else: replace ALL **... strings (not at beginning of line)
    		s/([^^][^\*]*)(\*\*+)/\1<nowiki>\2<\/nowiki>/g
    		# 		also replace ALL //... strings 
    		s/([^\/]*)(\/\/+)/\1<nowiki>\2<\/nowiki>/g
    		#		change the ones that have been replaced in a link [] BACK to normal (do it twice in case [http://addres.com http://address.com] ) [quick and dirty]
    		s/([\[][^\[]*)(<nowiki>)(\/\/+)(<\/nowiki>)([^\]]*)/\1\3\5/g ; s/([\[][^\[]*)(<nowiki>)(\/\/+)(<\/nowiki>)([^\]]*)/\1\3\5/g
     
    		p
    	  ' \
    	| sed -r -n '
    		# See also: http://www.grymoire.com/Unix/Sed.html#uh-40
    		# 	http://en.wikipedia.org/wiki/Regular_expression
    		# This is pretty advanced sed syntax, so I ll try to explain as much as possible
    		################################################################################
     
    		# if line starts with a space, add it to the hold buffer
    		# we do this by 'branching' to :addtopre
    		/^ [ ]*[^ ][^ ]*/ b addtopre
    		# if line has only whitespace or is empty, the preformatted block is over, so we surround that with <pre>
    		# we do this by 'branching' to :outputpre
    		/^[ ]*$/ b outputpre
    		# if line starts with NO whitespace, the preformatted block is over, so we surround that with <pre>
    		/^[^ ].*$/ b outputpre
     
    		#else this is a normal line
    				#s/(.*)/NORMAL LINE: \1/g; p
    			# print the line
    			p
    			#delete the current pattern space (so new cycle is started -> jumps to top)
    			d
     
    		# this is a line that should be part of a CODE block
    		:addtopre
    			#add it to the hold buffer
    			H
    				#s/(.*)/ADDED LINE: \1/g; p
    			# if this is the last line of the file (end-of-file), empty this line and then output this last preformatted block
    			$ { s/.*//g
    				b outputpre
    			}
    			#delete the current pattern space (so new cycle is started -> jumps to top)
    			d
    		# this is where a paragraph is surrounded by <pre></pre>
    		:outputpre
    				#s/(.*)/END OF CODE LINE: \1/g; p
    			# HOLD buffer is exchanged with the pattern space
    			x
     
    			# IF not empty, surround with <PRE> and PRINT the pattern space
    			/(.+)/ {
    				# surround it with <pre>
    				s/(.+)/<pre>\1<\/pre>/g
    				p
    			}
    			# exchange pattern space and hold buffer again, pattern is now the current line (not part of the preformatted block) and PRINT this line
    			x
    			p
    			#delete the current pattern space			
    			s/.*//g
    			#and exchange this again with the hold buffer, so that the hold buffer is empty again			
    			x
    			#delete the current pattern space (so new cycle is started -> jumps to top)
    			d
    	' \
        > mediawiki0
     
    # Headings
    cat mediawiki0 \
       | sed -r 's/^[ ]*=([^=])/<h1> \1/g' \
       | sed -r 's/([^=])=[ ]*$/\1 <\/h1>/g' \
       | sed -r 's/^[ ]*==([^=])/<h2> \1/g' \
       | sed -r 's/([^=])==[ ]*$/\1 <\/h2>/g' \
       | sed -r 's/^[ ]*===([^=])/<h3> \1/g' \
       | sed -r 's/([^=])===[ ]*$/\1 <\/h3>/g' \
       | sed -r 's/^[ ]*====([^=])/<h4> \1/g' \
       | sed -r 's/([^=])====[ ]*$/\1 <\/h4>/g' \
       | sed -r 's/^[ ]*=====([^=])/<h5> \1/g' \
       | sed -r 's/([^=])=====[ ]*$/\1 <\/h5>/g' \
       | sed -r 's/^[ ]*======([^=])/<h6> \1/g' \
       | sed -r 's/([^=])======[ ]*$/\1 <\/h6>/g' \
       > mediawiki1
     
    cat mediawiki1 \
       | sed -r 's/<\/?h1>/======/g' \
       | sed -r 's/<\/?h2>/=====/g' \
       | sed -r 's/<\/?h3>/====/g' \
       | sed -r 's/<\/?h4>/===/g' \
       | sed -r 's/<\/?h5>/==/g' \
       | sed -r 's/<\/?h6>/=/g'  \
       > mediawiki2
     
    # lists
    cat mediawiki2 \
      | sed -r 's/^[*#][*#][*#][*#]\*/          * /g'  \
      | sed -r 's/^[*#][*#][*#]\*/        * /g'    \
      | sed -r 's/^[*#][*#]\*/      * /g'      \
      | sed -r 's/^[*#]\*/    * /g'        \
      | sed -r 's/^\*/  * /g'                  \
      | sed -r 's/^[*#][*#][*#][*#]#/          - /g'  \
      | sed -r 's/^[*#][*#][*#]#/        - /g'    \
      | sed -r 's/^[*#][*#]#/      - /g'      \
      | sed -r 's/^[*#]#/    - /g'        \
      | sed -r 's/^#/  - /g'                   \
      > mediawiki3
     
     
    #[url text] => [url|text]
    cat mediawiki3 \
      | sed -r 's/([^[]|^)(\[[^] ]*) ([^]]*\])([^]]|$)/\1\2|\3\4/g' \
      > mediawiki4
     
     
    #[link] => [[link]]
    cat mediawiki4 \
      | sed -r 's/([^[]|^)(\[[^]]*\])([^]]|$)/\1[\2]\3/g' \
      > mediawiki5
     
    # bold, italic
    cat mediawiki5 \
      | sed -r "s/'''''(.*)'''''/\/\/**\1**\/\//g" \
      | sed -r "s/'''/**/g" \
      | sed -r "s/''/\/\//g" \
      > mediawiki6
     
    # talks
    cat mediawiki6 \
      | sed -r "s/^[ ]*:/>/g" \
      | sed -r "s/>:/>>/g" \
      | sed -r "s/>>:/>>>/g" \
      | sed -r "s/>>>:/>>>>/g" \
      | sed -r "s/>>>>:/>>>>>/g" \
      | sed -r "s/>>>>>:/>>>>>>/g" \
      | sed -r "s/>>>>>>:/>>>>>>>/g" \
      > mediawiki7
     
    cat mediawiki7 \
       | sed -r "s/<code>/\'\'/g" \
       | sed -r "s/<\/code>/\'\'/g" \
      > mediawiki8
     
    cat mediawiki8 \
       | sed -r "s/<pre>/<code>/g" \
       | sed -r "s/<\/pre>/<\/code>/g" \
      > mediawiki9
     
     
    cat mediawiki9 > dokuwiki

There is also one issue, when bold and italic texts are combined. I tested with the German UNIX Wikipedia article and there were 2 tags that made whole parts of the generated DokuWiki in bold. The following code fixes this behaviour:

$ diff mediawiki2dokuwiki.sh mediawiki2dokuwiki.sh.080925-1
7d6
< # changes by Reiner Rottmann: - fixed erroneous interpretation of combined bold and italic text.
165,169c164
<
< cat mediawiki9 \
<    | sed -r "s/\*\*\/\//\/\/\*\*/g"> mediawiki10
<
< cat mediawiki10 > dokuwiki
---
> cat mediawiki9 > dokuwiki
 
wiki/mediawiki_to_dokuwiki.txt · Last modified: 2009/12/31 09:24 (external edit)     Back to top
Recent changes RSS feed Creative Commons License Powered by PHP Driven by DokuWiki