saidone.org -
often imitated, never duplicated!
Main menu
Development
Mailing lists
About

Swamp
Links
ush.it
aghers.org
saidone@ush
gameknot
virtualmagister

Misc
BBCode-like parser work in progress - posted by saidone on Fri, 08 Aug 2008 10:13:17 GMT
This page is a test for the new MediaWiki BBCode like parser. I was tired to write and maintain very big and complex parsers, so I decided to rewrite it using JavaCC, a parser/scanner generator for java. I hope that this will fit all my needs.
In the beginning I was thinking to implement some sort of MediaWiki syntax, but I found that is more prone to ambiguity and a parser would become too bloated (maybe I'm wrong eh?). I don't like these things, here's an example:

' stands for apostrophe
'' for italics
''' for bold

Now the string ''' can be interpreted in various modes... an italicized apostrophe waiting for the italic end tag, an apostrophe followed by the italic start tag, the bold tag itself, or a sequence of three apostrophes. Yes, we can match and replace something with regular expressions, and maybe obtain a somewhat good result, but the code rapidly become a mess and adding further tags is far from trivial then. So, at the moment, I choose BBCode as a lightweight markup language for the posts.

Here's some tests:
italic
bold
italic + bold
strikethru

quoted text


This is an url: http://www.ush.it
and another: ush.it - a beautiful place
10 REM YOU WILL HARDLY NOTICE THE "CODE" TAG
20 REM WITH THIS CSS!
30 PRINT "CIAO"
40 GOTO 30
*colored text* (only hex encoded colors at the moment)
...and yes, there's also the img tag: (actually, my car ;-))

Here's the grammar source code (to be polished): KBbCodeParser.jj

REFERENCES
JavaCC home: https://javacc.dev.java.net/ An excellent FAQ maintained by Theodore S. Norvell: http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm



Comments

Source code is not same as live version written by Karen on Mon, 27 Sep 2010 18:04:05 GMT
I don't think the provided source code can actualy generate the provided example (as on this page). Because token defenition <CONTENT> will match all tags ([b], [i], etc.). The tokens <BOLDSTART>, <BOLDEND>, <ITALICSTART> will only be matched in the rare situation that there is no content next to it. As the longest token is always selected, the <CONTENT> token will have priority over those <BOLDSTART>, <BOLDEND>, etc. tokens. Do you still have the correct version, that does generate this page? Because I find it very difficult to fix it and make a good BBCode parser (that can parse not well formated BBCode, and thus allows unopened [/b] and unclosed [b] tags).


It's no more in use... written by saidone on Tue, 28 Sep 2010 00:27:15 GMT
Well, in spite of the formal correctness that would be achieved using a real parser (and a good grammar), I just managed to dump that, given the complexity of generated code that would not fit well with the rest of this work, inspired by minimalism and pragmatism. The new "parser" (and what is rendering this page, for example) is now just a little collection of regular expressions and replace statements, that just do the work.




Post a new comment

<-- comment title (well, not really mandatory)
<-- your name
<-- your e-mail
<-- subtract 2340 from 9408

back to home
[[[[[[[ served by kugelmass ]]]]]]]
The trouble with programmers is that you can never tell what a programmer is doing until it's too late. - Seymour Cray