StringParser_BBCode class documentation

1. Introduction

1.1 General

The StringParser_BBCode class provides the possibility to parse strings with BB-Codes and convert them to e.g. HTML code. BBCode is a kind of markup "language" with which one may structure and format text. It is similar to HTML but it utilizes square braces instead of angle brackets. Another difference between BBCode and HTML is that when using BBCode invalid code is ignored whereas the validity of the code is important when using HTML.

Here is an example for a text that was structured with BBCode:

This is a [b]bold [i]text in italics[/i] that has no meaning[/b]!

This text could now be convertet to HTML:

This is a <b>bold <i>text in italics</i> that has no meaning</b>!

This would look like:

This is a bold text in italics that has no meaning!

The simplest possibility here to convert the BBCode to HTML would be to replace [b], [i], [/b] and [/i] through <b>, <i>, </b> and </i>. This would work fine with the above example. But this would also cause problems if somebody mistyped the BBCode. An example:

This is a [b]bold [i]text in italics[/io] which has no meaning[/b]!

The author of the text mistyped while typing the [/i], also touching the o key and therefore producing the input [/io] instead of [/i]. If one would just do the simple replacements mentioned above, the following HTML code would be generated:

This is a <b>bold <i>text in italics[/io] that has no meaning</b>!

This is invalid HTML since the elements are not correctly nested and the <i> element is never closed. There are other approaches to convert BBCode that use regular expressions to make sure every element is closed correctly. But thos approaches cannot ensure the correct nesting of the elements.

This is why this class takes a different approach. The text is read character by character and the complete text is converted in a tree structure. This tree structure is then converted to HTML after the complete text has been originally converted to the tree. The text

This is a [b]bold [i]text in italics[/i] that has no meaning[/b]!

would be converted into following tree strcuture:

Now if a text such as above, where the [/i] is missing, is to be converted the class will realise this at the [/b]. This is because the [/b] would appear while the class would still be waiting for a [/i] because she knows that the [i] is still open. Now there are two possibilities where the programmer using the class may decide what happens exactly: The first option would be to declare the [i] invalid, append it to the text just after "bold" and continue parsing at that location. The second option would be to imply a [/i] directly before the [/b] in order to close both elements. What the class does not do is to guess that [/io] could still mean [/i] - this would lead to the problem that the class would make errors elsewhere thinking that she is correcting an error that in reality is none.

The class itself does not impose the constraint which codes are looked for. [b] and [i] are popular examples and are therefor quoted here The class itself however provides the possibilities to define own codes as long as they occur in square braces. In the following chapter it is shown how to define own codes.

In addition to pure tags there is also the possibility to use attributes as in HTML. These would look like the following for example:

This is a [b strength=really_bold]really bold text[/b]!

The class detects the following different forms of attribute syntaxes:

[code attribute=value], [code attribute="value"], [code attribute='value']
This is the form that is the form most similar to HTML. Furthermore it is the only form that allowes to set more than one attribute at the same time. The expression in front of the equal sign is the attribute name and the expression behind it is the attribute value. If a value is put in double or single quotes spaces and a closing square brace (]) will also be allowed inside the attribute value. If you also want the attribute value to contain a quote character itself it must be escaped with \. Example: [code attribute="value ] we are still inside the value\" yes, this also belongs to the value"].
[code=value], [code = value], [code="value"], [code='value']
In this form it is possible to set only one attribute. This attribute always has the name default. The syntax[code=value] would be identical to [code default=value]. This syntax is very similar to classical BB-Code.
[code:value], [code: value]
This is another possible syntax and the attribute name is also default here.

1.2 Nesting

As seen above elements must be nested correctly. This is guaranteed by the class. Nevertheless, the above check is only a formal check. The following example shows the problematic clearly:

[b]This is a list:
[list]
[*] List item
[*] List item
[/list]

[/b]

If this is converted to HTML the following would be the output:

<b>This is a list:
<ul>
<li> List item</li>
<li> List item</li>
</ul>
</b>

This HTML code is indeed formally correctly nested but in HTML the <b> element must not contain a <ul> element. This would also cause invalid HTML. Because of this it is possible to tell the class which element may contain which other element. For this purpose there are the so-called content types. Every element is assigned a content type. Further it is possible to specifiy for each element the content types inside which it is allowed. An example:

[a][b][c]Text[/c][/b][/a]

In this example the [b] element would be inside the [a] element and the [c] element would be inside the [b] element. The following tree would be created:

Now we assign a content type to each element. The [a] element receives the content type alpha, the [b] element den Inhaltstyp beta and the [c] element the content type gamma. To make sure the parser converts every element the [b] element must be allowed inside the alpha content type because this is the content type of the [a] element, inside which the [b] element resides. In the exact same manner the [c] element must be allowed inside the beta content type because that is the content type of the [b] element, inside which the [c] element resides. But the [c] element needs not to be allowed inside the alpha content type because only the first level is of relevance.

If no element has yet been openend the so-called root content type is applied. This content type is block by default but it can be changed. Have a look at the chapterparser functions, in which content types play another role.

But there is not only the possibility to specify the content types in which an element is allowed in - one has also the possibility to forbid an element inside certain content types. A link inside another link would not be that reasonable. On the one hand, it is easy to forbid a link directly inside another link. On the other hand, it is possible to put an element in between to work around the list of allowed content types. Example:

[link][b][link]Text[/link][/b][/link]

There is no reason to forbid [b] inside of [link] and there is also no reason to forbid [link] inside of [b] but there would necessarily be a reason for forbidding this construction. At this point the list of disallowed content types comes in. This list is applied to all levels whereas the list of allowed content types is only applied to the topmost level. With this method it is possible to inhibit constructions like the above.

1.3 Special codes

Sometimes it can be useful to deactivate the code detection for a short period. In many forums the [code] element is offered in which it is possible to mark up portions of source code and inhibit the parsing of [b] and similar inside this part of the text. The part may only be terminated by [/code]. The class posses the means to acchieve this behaviour very in a very simple way:

[code]
// this would be example code that replaces the [b] bbcode:
// ...
[/code]

In this example it is certainly not wanted that the [b] is converted since the [b] is part of the source code that is to be shown litterally. For this there is a so-called processing type usecontent that causes the class to only look for the end tag of this specific element and ignore every other code.