Parsing Tests

Testing some approaches to parsing TWiki marked up text in Ruby.

%SECTION{summary}%

See:

Background and Discussion
Results
Partial Comments from Robert Klemme
Regular Expressions
Program Notes
Attached files
Testing Uppercase URLs
Testing a Web (only) Link
Testing List Items
Contributors
Revision Comment
Page Ratings

Background and Discussion

I'm working on writing a wiki-like thing (with a bigger scope than most wikis). I've decided to learn Ruby (after giving up on C, C++, and Perl--I almost decided to use Python, but decided to go on and try Ruby, and that's where I am now), so I'm trying to write at least some early versions in Ruby. Although there are other places I could start, probably with greater effect (as I could always use TWiki as the "back end" to my "bigger scope" ideas), I got started on the idea of parsing (and then translating to HTML) the TWiki markup.

Because TWiki seems slow (for a lot of reasons other than the parsing / HTML translation code including things like my dial up connection, and TWiki dynamic content generation (like the inline search thingie)), I'd like to make this code as efficient as reasonably possible. I may even rewrite it someday in something like C.

My impression is that many wikis do the parsing by multiple RE (Regular Expression) passes through the entire document. I decided to try to make a single pass through the document instead. I've vacillated on the approach. I've investigated a character by character approach enough to believe that I could do it. More recently, I've been investigating a hybrid approach in which I go through the document character by character to find places where a (more complex) RE might match, then invoke the RE at that particular place.

In some early tests, the approach seemed to show some promise. For RE's that can only match at the start of a line, the approach seemed to show that I could achieve time savings on the order of 1 to 10%. Contrast code example re_test1.rb with re_test2.rb and re_test3.rb with re_test4.rb (all code examples are attached), and the results below. (The odd numbered tests use a pure RE approach, the even numbered tests use the test a character, then apply an RE approach.) Results showed improvements ranging from ~1% to ~10%.

I thought that if I could achieve an overall 50% or higher improvement, the extra complexity might be worhtwhile, and might be achievable in that, if I save 10% when dealing with REs that can only match at one point in a line, I might save a lot more on an RE that could match anywhere in a line (because I'd avoid that many more invocations of the RE).

So, far, I'm very wrong! When I tried a similar approach for one RE that could match anywhere in a line (as opposed to REs that could only match at the beginning of a line), I found my modified approach caused a drastic dis-improvement--my approach took 30 times longer than a "straight" RE approach.

This might be because my code to do the checks may not be the most efficient way to do the job in Ruby. The code tested is in files re_test5.rb (pure RE approach) and re_test6.rb (test single character and apply RE approach).

So for the time being, I'll set this approach aside.

I'm sure someone would tell me that these results are "reasonable" (i.e., within expectations for an interpreted language like Ruby)--the RE scanning code is (I'm sure) highly optimized and written in C or assembler, while each invocation of a line of Ruby code has a fairly high overhead.

Nevertheless, if I decide to do the entire parse and translate job in C, I would look at the character by character approach again--it should be much more efficient in C.

In the meantime, somebody told me about StringScanner, and I'm going to try to learn enough about it to see if that is a better alternate.

Results

bash-2.05b$ re_test_s.rb
# IIRC, this was the result of running re_test1 and 2 once (but note that those programs run the "inner loop" 10000 times).
Test 1: 0.1848882
Test 2: 0.1831371
Test 2 / Test 1: 0.990528870960937
bash-2.05b$ re_test_s.rb
# IIRC, the results of 100 runs of re_test1 and 2
Test 1: 0.194170098
Test 2: 0.177951283
Test 2 / Test 1: 0.916471098448947
bash-2.05b$ re_test_12s.rb
10000 trials:
Test 1: 0.3369518805
Test 2: 0.3074213903
Test 2 / Test 1: 0.912359918703586
bash-2.05b$


bash-2.05b$ re_test_34s.rb
1 trials:
Test 3: 1.712266
Test 4: 1.572851
Test 4 / Test 3: 0.918578655419193
Total time: 4.136724
bash-2.05b$ re_test_34s.rb
10 trials:
Test 3: 1.5968127
Test 4: 1.4871622
Test 4 / Test 3: 0.931331645846755
Total time: 36.008019
bash-2.05b$ re_test_34s.rb
1000 trials:
Test 3: 1.297943237
Test 4: 1.198288135
Test 4 / Test 3: 0.923220754837985
Total time: 2972.829203
bash-2.05b$


bash-2.05b$ re_test5.rb
0.550475
[[Web.WikiHome][a link]]
bash-2.05b$ re_test6.rb
15.806495
[a link]]
bash-2.05b$

Partial Comments from Robert Klemme

And some new programs and results—from an email on the ruby-talk list, ca. 3/22/05

Thanks for all of the following! I did substitute them in test 6 to see what they would do.

# old (6): ~18 seconds

# with range (6a): ~16 seconds (fixed now, with help from Robert, see below)

# with upto (6b): ~14 seconds

# with times (6c): ~12 seconds

#6d: ~0.75 seconds (This is the test that convinced me the iterations are the problem, I revised the (with times) program to only call the RE once, although it still scans only from the start of the string--I guess I should try test 6e with the RE not anchored.)

#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm puzzled, how is this faster than the anchored version?? Anyway, at this time I don't care, I'll just "file it away" as a little anomaly to perhaps understand some day (and, as I haven't run the test multiple times or similar in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the TWiki--if you are really interested I can do that, but, as I say below, I'm not going to lose sleep over the problem with range.
For some reason that I haven't figured out (yet?), the "with range" option didn't work. I'm not going to lose sleep over it--I did try some troubleshooting, but it may be a rather subtle bug (or I have a very dense head). Update: turned out to be a "garbage" whitespace character--now fixed.
When I run it as part of a program (re_test_6a.rb), I get the following error messages:
bash-2.05b$ re_test6a.rb ./re_test6a.rb:40: Invalid char `\240' in expression ./re_test6a.rb:41: Invalid char `\240' in expression ... ./re_test6a.rb:66: Invalid char `\240' in expression bash-2.05b$
Anyway, since I went this far, I will upload programs 6a thru 6e to the TWiki
> # old
> i = 0
> until i==s1.length-6 do
>   if s1[i] == 91
>     s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
>   end
>   i += 1
> end
>
> # with range
> (0...(s1.length-6)).each do |i|
>   if s1[i] == ?[
>     s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
>   end
> end
>
> # with upto
> 0.upto(s1.length-7) do |i|
>   if s1[i] == ?[
>     s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
>   end
> end
>
> # with times
> (s1.length-6).times do |i|
>   if s1[i] == ?[
>     s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
>   end
> end

Regular Expressions

I put these results here at the request of Robert Klemme, who asked also that I post the REs. I pointed out that I haven't yet created all the necessary REs (and further, the REs I have created so far are not necessarily correct). So far I have not resorted to going through the TWiki code and looking for their REs (I guess I thought I might learn something by creating my own, and avoid the pain of trying to figure out what their REs (and Perl code) are doing.

Here are the Regular Expressions I've worked out (or worked on) so far:

/\A---\*{1,6}\x20/ # for headings (heading level by match.length - 4)

/\A\x20{3}\*\x20/ # for unnumbered list items at level 1 (each 3 spaces represents a level, I need to generalize for all possible levels, and * is for an unnumbered list, 0--9 for a numbered list, i for l.c. roman, etc., for which I need to generalize also)

/\[\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/ # to look for one case of forced links (similar to [[Web.WikiHome#Anchor][a link]] (with a lot of options--most common (??) is wiki page only (WikiHome), but can have Anchor alone (for current page), or page plus anchor. This has not been tested thoroughly.

/(([a-z]*):)?([^]]*)](\[(.*)])?]/ # Similar to above--but for the case when the link is an external link (like http://, https, ftp, gopher, mailto, etc.

<later> # Very similar to above, but the unlabeled case, when the link is not forced (i.e., no square brackets)

/[A-Z]+[a-z]+[A-Z]\w*/ # For a TWiki wikiword

/\*\B|\B\*/ # some brainstorming for "inline" markup, which allows, for example, either *this,* or *this*, and does the right thing

Program Notes

At the time I wrote these, I didn't know about the existence of the Benchmark module--next time I'll use that.

Note that I tried to change strings between iterations and, in the "summarizing programs" (re_test_12s.rb and re_test_34s.rb), I alternated running of the pure RE and my approach for various reasons including hoping that:

I wouldn't trigger any optimizations that might make the results uncharacteristic
that if garbage collection was invoked, there was at least some chance it would be invoked with similar frequency for both approaches and be less likely to distort the results.

I don't know if the Benchmark module does anything clever to avoid such potential problems.

Attached files

re_test1.rb: Pure RE approach checking 3 strings looking for a heading (10000 times)
re_test2.rb: Similar to above, except it invokes the RE only if the first character is a "-"
re_test_12s.rb: Program to run multiple "runs" of previous two programs and summarize results
re_test3.rb: Pure RE approach checking 4 strings looking for a heading or a level 1 unnumbered list item (10000 times)
re_test4.rb: Similar to above, but invokes the appropriate RE only if the first charater is a "-" or \x20 (space)
re_test_34s.rb: Program to run multiple "runs" of previous two programs and summarize results
re_test5.rb: Pure RE approach, checking 5 strings for a forced (TWiki) link ([[][]])
re_test6.rb: Similar to above, except it only invokes the RE (on a substring) if a "[" is found in the string (and there are enough characters left to achieve a match)
re_test7.rb: Similar to re_test6.rb, but (1) not finished, and (2) intending to use a destructive slice instead of a substring approach. At the moment, I can't imagine this is faster than re_test6.rb, and that is so much slower than the pure RE approach, I'm not going to pursue this.

Testing Uppercase URLs

These don't work!! That's good, then I can use capital letters to distinguish the WikiPage link from the URL style links shortly after the [[ (or whatever). smile

HTTP://TWIKI.ORG

HTTP://TWIKI.ORG/CGI-BIN/VIEW/

HTTP://TWIKI.ORG/CGI-BIN/VIEW/Wikilearn/WebChanges

HTTP://TWIKI.ORG/CGI-BIN/VIEW/WIKILEARN/WEBCHANGES

Testing a Web (only) Link

Codev

#Results

Codev

#Results

Testing List Items

How low can you go?
- How low can you go?
  - How low can you go?
    - How low can you go?
      - How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?
        
        How low can you go?

How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?
How high can you go?

Contributors

() RandyKramer - 20 Mar 2005
If you edit this page: add your name here; move this to the next line; and if you've used a comment marker (your initials in parenthesis), include it before your WikiName.

Revision Comment

%SECTION{last_revision}%

%DATE% —

Page Ratings

WebForm
PageStatus	Scribbles

Topic revision: r6 - 2005-03-27 - RandyKramer

Edit
Attach

Copyright � 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding WikiLearn? WebBottomBar">Send feedback
See TWiki's New Look