Tags:
create new tag
, view all tags

Using %INCLUDE{"url"}% is Super Slow

The TWiki::Net package uses sockets to implement reading a url. This is super slow, as in about 30 seconds for each get, and therefore, INCLUDE is unusable. Such an access should only take a fraction of a second.

For example, consider the following usage:

%INCLUDE{"http://www.cia.gov/cia/publications/factbook/geos/in.html" pattern="^.*?<div align=\"right\">Background:<\/div>.*?<br>([^<]*)</td>.*"}%

I added some time checks in the TWiki.pm file, in the routine that handles url includes. Here is the result.

Start Time = 1068164943
URL Received Time = 1068164973; Delta = 30
Cleanup File Time = 1068164973; Delta = 0
Apply Pattern Time = 1068164973; Delta = 0

This shows that it is taking 30 seconds to do the URL access, and all other processing is insignificant. Try accessing that page here:

http://www.cia.gov/cia/publications/factbook/geos/in.html

What is the excuse for this ultra-poor performance? I see the use of sockets in Net.pm. Why? Why not use LWP? It is fast.

I hacked the Net.pm code adding in (roughly) the routine below and the time for url getting was less than 1 second (0). There is a need to modify both TWiki and Net.pm as much of the work in preparing the URL is done in TWiki.pm, which is a partitioning mistake. Whatever the course of events, ONLY THE URL SHOULD BE PASSED TO Net.pm, or you are making assumptions as to what that package needs to do. You have no other information, so why muck with the URL first, and then pass the port, domain, etc.

Consider the following subroutine that I have used in the past for this:

#-------------------------------------------------------------------
sub readurl { my ($url, $rbuf, $username, $password, $mode) = @_;
# access $url on web and return result in $$rbuf.
# return value '' on success, else error message.
# $mode may be "GET" or "POST".
# use url starting with https to use secure mode.
#--#

    $mode ||= 'GET';

    print "**** Attempting to read url: '$url'\n" if ($^W);
    
    my $progname = $0;
    $progname =~ s|.*/||;  # only basename left
    $progname =~ s/\.\w*$//; #strip extension if any
    
    use LWP::UserAgent;
    use LWP::Protocol::https;
    my $ua = new LWP::UserAgent;
    $ua->agent($progname.'/0.1' . $ua->agent);

    my $req = new HTTP::Request ($mode => $url);
    $req->content_type('application/x-www-form-urlencoded');
    # send request
    if ($username) {
        $req->authorization_basic($username, $password);
        }
    my $res = $ua->request($req);
    # check the outcome
    unless ($res->is_success) {
        print "#### FAILURE: '".$res->code . " " . $res->message."'\n" if ($^W);
        $$rbuf = '';
        return $res->code . " " . $res->message;
        }
    else {
        print "****     Success: ".length ($res->content)." bytes read'\n" if ($^W);
        $$rbuf = $res->content;
        }
    return '';
}
#-------------------------------------------------------------------

What do you think of this sort of approach? The implementation about is very fast, and should work anywhere that LWP is installed.

I could find no other discussion of this topic, but if you know of one, please list it here.

-- RaymondLutz - 07 Nov 2003

Have you seen CpanPerlModulesRequirement?

-- MartinCleaver - 07 Nov 2003

I added my opinion to that page. The bottom line is that this feature doesn't work in the current implementation, regardless of the disposition of the other question. Either the Net.pm module must be written to perform adequately (it is at least 30x slower than it should and can be) or we use LWP, or some other alternative. Quoting from some other argument that does not fix the problem is not productive.

-- RaymondLutz - 07 Nov 2003

TWiki::Net is fast in retrieving web pages, in a few 100s of a second if you have a fast internet connection to a fast site. 30 seconds for a page sounds like some issue related to your environment. What TWiki version are you using?

The geturl utility uses the same socket based algorithm like TWiki::Net. Here are actual timings I did:

% time wget http://TWiki.org/download.html >tmp
--12:31:02--  http://twiki.org/download.html
           => `download.html.7'
Resolving twiki.org... done.
Connecting to twiki.org[66.35.250.210]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16,565 [text/html]

100%[================================================================>] 16,565       158.60K/s    ETA 00:00

12:31:02 (158.60 KB/s) - `download.html.7' saved [16565/16565]

real    0m0.223s
user    0m0.010s
sys     0m0.000s

% time ./geturl TWiki.org /download.html > tmp

real    0m0.242s
user    0m0.020s
sys     0m0.000s

That is, the optimized wget utility takes around 0.230 sec, TWiki's geturl around 0.260 sec.

Notice that INCLUDE and geturl do not work at SourceForge due to their setup.

-- PeterThoeny - 08 Nov 2003

Peter, I think you are missing the point. My test above clearly implicates the code in TWiki::Net::getUrl() as the problem. Here's why:

  • I substituted only that code using CPAN LWP and obtained results that are at least 30x faster. Nothing else was changed.
  • Your questions about what kind of connection I am using and what kind of site is irrelevant, as my test makes these questions superfluous. Just so you know, the site is on a server which has a direct optical backbone connection. But, so what? Even if I was on a dial-up line, if I could show that the code can be substituted for at least 30x performance improvement, that would still eliminate the server question from the equation.
  • The justification behind NOT using CPAN is that code in the release would be roughly equivalent to these modules. And, I agree with that approach in most cases, as it allows somewhat better control of features and is not reliant upon the code in CPAN for quality. Frankly, I am a bit disappointed with about 50% of the CPAN code in terms of its quality and flexibility, and when we can avoid using it, I find we have better control over the code. However, getting back to the point, if LWP will perform in my environment, whatever that may be, the TWiki::Net.pm code should also perform.

I am using the Beijing release, which is the most recent (non-alpha) release.

In a private email message, I learned the following. Let me know what you think of these assertions:

The problem you're seeing is a symptom of TWiki stating it can support HTTP/1.1 responses, when it can in fact only support HTTP/1.0 responses (just).

If you're using a version that correctly states to servers & proxies that the client can only handle HTTP/1.0 responses (ie sends a METHOD URL HTTP/1.0 request line), then the response is pretty much instantaneous. If you're using a TWiki production release (or indeed, IIRC even the last TWiki.org beta) then you have problems. These will range from:

  • Spurious numbers appearing - this is due to HTTP/1.1 requirements on clients to support Chunked Encoding

  • Long download times - typically 30 seconds to finish the download - the reason for this is the keepalive mechanism in HTTP/1.1 differs from HTTP/1.0, and if a client declares HTTP/1.1, the server will hold the connection open for around 30 seconds unless the client specifies the connection should be closed - this is to allow request pipelining. However in HTTP/1.0 a client tends to treat connection closure as the end-of-file marker - which is why you're seeing hangs.

There are other symptoms you can see, but the above two are the most damaging.

I imagine these problems may have already been fixed in some patches to the Beijing release. Please point me to those patches, as I was unsuccessful in searching for them.

The other improvement I made to the INCLUDE functionality as I was hacking around with this is the use of CACHING, so that multiple accesses to the same page (for different parts of that page) will not require another redundant http read of the same page. This becomes a concern even with "fast" operation if multiple regex patterns are accessed. I would note that again, the partitioning of the code was not optimal. The INCLUDE{} functionality should be moved to a separate module and not bundled into the core. The call to TWiki::Net::getUrl() should pass ONLY the url, not a preparsed url. The parsing should be done within the Net module, as it is the only routine that ever uses it.

As you mentioned, the INCLUDE function doesn't even work on the TWiki.org server. See the sample usage of this function on RaymondLutzSandbox. Even though INCLUDE does not work, this page loads SLOWER than mine does with full INCLUDEing. I'm wondering what's wrong with the SourceForge setup such that this "Standard" feature is not operational.

-- RaymondLutz - 10 Nov 2003

I share your sentiments, Raymond.

You might like also to look through CodevDocumentationProject, but I seem to be the only one updating these topics.

-- MartinCleaver - 10 Nov 2003

Could you use the Net.pm from the latest TWikiAlphaRelease, CVS:/lib/TWiki/Net.pm . This version reverted back to HTTP/1.0.

Argument against using lots of CPAN modules have been discussed in CpanPerlModulesRequirement. It boils down to startup perfomance (cound # lines to compile and interpret in LWP vs. the 40 lines in TWiki::Net::getUrl); quality; extra admin skills needed to install CPAN modules (vs. unzip TWiki distro); secure sites with TWiki not connected to the Internet (just Intranet).

-- PeterThoeny - 11 Nov 2003

I tried to install just Net.pm but I had already modified TWiki.pm as well to pass the entire URL to Net.pm instead of a bunch of parameters. I tried installing just TWiki.pm as well (wow it is a lot bigger) but it looks like I will have to install everything in the alpha release and that will take more time. In your statement above, can you pls explain what "this" refers to, i.e. does the alpha version revert to HTTP/1.0 or does the Beijing release revert?

-- RaymondLutz - 11 Nov 2003

Beijing introduced HTTP/1.1, the latest Alpha reverted back to HTTP/1.0.

If you have an unmodified TWiki.pm from Beijing you can simply drop in the Net.pm from the latest Alpha.

Unfortunately, I had to modify the TWiki.pm file to pass the entire url to my readurl() function as it was being preparsed in TWiki.pm to provide the multiple parameters used by the TWiki::Net::getUrl() function. I replaced both and it still didn't work, and I'm not sure why, but I can't investigate further now. -- RaymondLutz - 11 Nov 2003

-- PeterThoeny - 11 Nov 2003

That seems a bad decision to me. If you drop 1.1 in favour of 1.0 we can no longer contact virtual hosts. Why not just use LWP?

  • This decision is spec. TWiki supports HTTP/1.0 responses and sends HTTP/1.1 requests. If you do this you have to state HTTP/1.0 in the request line as per RFCs. This is why owiki doesn't suffer from this problem as do neither recent alphas. You can see Raymond's test example above in here. Raymond performed further tests here. (the latter is performing several remote includes and is slow since several remote includes do take time. Raymond's since patched his code to cache the results.) -- ...

-- MartinCleaver - 11 Nov 2003

Martin, in regards to "bad decision", please read ProxiedIncludesBrokenImplementationBug. I alerted the community about the potential virtual host issue, you said that it works OK.

-- PeterThoeny - 11 Nov 2003

With regards to LWP and the CpanPerlModulesRequirement, LWP is part of the libwww-perl bundle. It is the "gold standard" and recommended way to contact remote hosts. To not use it is tantamount to suggesting that a person should write their own database access routines instead of using DBI.

If LWP is not already installed on the host someone is using, I would consider that installation broken.

With LWP, it would be trivial to add to twiki additional functionality while completely encapsulating added complexity. For instance: 1) the ability to respect the robots.txt exclusion file of a remote host. 2) basic and digest authentication to remote hosts. 3) Including pages from remote hosts using protocols other than HTTP, including HTTP-SSL, FTP, and NNTP. 4) including responses from a submitted FORM query to a remote host.

-- TomKagan - 11 Nov 2003

Granted, I did say it worked. I still think it is really bad practice to continually reinvent the wheel. We've been doing this elsewhere as well, for instance you built not only a renderer to render POD as TML, but built a parser for POD as well. This is more and more code that we (actually you) have to maintain, slowing everything down and distracting you from working on issues that would add value such as coordinating and inspiring the community.

-- MartinCleaver - 11 Nov 2003

I think you can divide the CpanPerlModulesRequirement into two categories:

  1. Code that is driven strictly by industry standards. These are usually communication protocols.
  2. Code that is arbitrary, and does not involve anyone else.

I don't it can be claimed that not using CPAN is enhancing quality in the case of (1), as the standards driven modules are usually coded according to the customary standards of the day. You really don't want to deviate from what is done in these standard modules. Enhancements that are not in the standard will usually break something else.

For communication protocols, they are not much use to strictly intranet or toy-twiki installations. I don't know of ANY commercial hosting service that does not provide these CPAN modules, fully installed from the get-go. Arguing FOR not using CPAN for these is bankrupt. It is simply asking for more work when a route for high-quality code is available. Don't reinvent the wheel here, just use what is available.

I propose that we take the following steps:

  1. Abandon the TWiki::Net::getUrl code and use the CPAN LWP instead.
  2. Move all of the parsing of the URL currently in TWiki.pm and do that in TWIki::Net.pm
  3. Implement page caching so that multiple accesses of the same page for different regex pattern data will no require another http request/response.

When I first encountered this problem, I thought that the code in TWiki::Net was there to support some unusual case, but it seems that is not the case. Let's use LWP.

-- RaymondLutz - 11 Nov 2003

If I may jump in here ever so briefly, the 'get' is actually blazing fast at reading data, the 'bug' in Net.pm is that it spins on a successful socket read but doesn't provide any connection control headers, so for sites that keep connections open, we have to wait for the connection to timeout and close before the loop terminates. Simply adding the "Connection: close" header to the request resolves this issue, correctly I might (humbly) add.

A quick search of Codev finds this topic: IncludeHTMLTakesLongTime and after having made those changes I find that including from an external site that used to be slow is now extremely quick (see http://thepettersons.org/twiki/bin/view/Main/DailyNews for a silly example - loads, I don't know, 50 RSS feeds - and does so in less that 15 seconds if not cached).

-- PaulPetterson - 11 Nov 2003

Thanks Paul, granted, it is fast, but the speed of INCLUDE is not the issue. We are concerned about how the policies affect the development effort.

Look, we've created a really round wheel! So round, WOW. should we be proud of ourselves? NO. How long did it take to find the piece of code in LWP, cut it out and call it our own? Not to mention, as Raymond did, the other things we lose by not using LWP. This is such a waste of time. The code is not particularly wrong, but this policy really is. I'd be nervous to stand up at a conference (such as TWikiPresentation21Jan2004) knowing that anyone could embarrass me with questions about such policies.

Frankly, if the core team won't listen, I'll continue to direct all our energy and innovation to http://owiki.org; and should the same thing ever to happen there we'd fork again. Such is the GPL.

    Might I respectfully request that O'Wiki not be used to bash the Twiki core team with. I don't do "my car is better than your car", and I don't think the two communities need be mutually exclusive, and nor would I wish the TWiki core team to feel unwelcome there (nor anyone else). I may personally have differences of opinion, but that's all they are. If TWiki.org wishes to "cherry pick" things from O'Wiki, they are welcome to do so - patches can be made trivially. Such is the GPL (subject to actually sticking to the GPL and acknowledging and tracking copyrights). At the end of the day both groups want the same thing, and disagree on methods and contribution policies. I run (at least a branch of) our CVS as a wiki, they run theirs as single webmaster syndrome system (or small number webmaster) - both policies are valid - just not to each others tastes - such as it is with this CPAN issue.

    Incidentally, Raymond I would strongly assert that the patch I provided resolves the problem (ProxiedIncludesImplementationBug includes the patch), and further strongly assert that Paul's comment won't hurt. (given I've been working with debugging HTTP problems and implementations in some very hairy environments for the past 5 years does that meet your "reputation" requirement? smile ) Indeed it will provide further "proof" to an HTTP/1.1 server that TWiki only supports HTTP/1.0, and will work with broken HTTP/1.1 servers that send everyone HTTP/1.1 pipelinable responses unless told to sever the connection. -- MichaelSparks - 13 Nov 2003

-- MartinCleaver - 12 Nov 2003

Responding to PaulPetterson, I would be happy to see us improve the getUrl() function with the changes you propose, but I am worried that this will not be the end of processing this protocol, which is certainly not the main purpose of TWiki. With the politics as they are, and the belief that someone is honestly concerned about installing LWP if they want to do INCLUDES with arbitrary URLs (I would like to hear of what cases these really are -- when does this happen: standalone processing urls). You (PaulPetterson) humbly state that it resolves the issue, but I would rather have someone strongly assert that it solves the issue, and put their reputation on the line. I think it is easier to do when you adopt the gold-standard for http requests, which the gold standard just because it is so widely used, and everyone agrees that is the way it will work.

As a newcomer to TWiki, when I read the Net.pm code I assumed that it LWP was not being used because there were some specific requirements that needed to be addressed over and above those provided by LWP. If not, the only argument for not using LWP is that it is easier to install. From my experience with hosting sites, LWP is nearly always pre-installed it is so widely used. Many applications use it and in my experience I have not seen private coding of this protocol unless there was something needed that was different from LWP.

Look, I'm just trying to get TWiki sites to work with reasonable functionality. The Beijing release doesn't work in this regard although the bug was posted and "fixed" in the past. The claim is made that the fix proposed by PaulPetterson will fix this for all time such that it will always be as good as LWP. Here are some claims:

  • TWiki is a damn good application or we wouldn't be so excited about it. Good job getting this off the ground.
  • Development of TWiki is far from over, and the development team is overloaded with good ideas and apparent bugs. The possibilities are large and this is one of the attractions to TWiki.
  • One of the goals of TWiki is to make it super easy to install so that it can work with a minmum of functionality on stand-alone boxes.
  • As TWiki sites that are large and numerous begin to happen, pressure will be felt to make TWiki more scalable, in competition with the minimalist view.
  • More of its functionality will be moved to other pieces of code, such as MySQL, etc. to handle the scaling problem and this will further integrate it into the existing environment we find in most web servers today.
  • LWP and other code will become even more standard in boxes where TWiki is found. I claim that it is already available, is used as a standard, and that the claim that we should support homegrown code is unfounded.
  • If you decide not to use a standard piece of code on CPAN, you should have some reason that is really good as you adopt maintenance expenses by deciding that homegrown code is the best course.

INCLUDE is a very important function that I am happy has been included in TWiki. If it wasn't available, I was planning to write it. It needs some improvement in other dimensions too but first we need to get by these first issues. If the current code works adequately with the changes suggested by PaulPetterson, then perhaps that is the end of that for now. I don't mind repairing the current code and moving on to other improvements. I am not in the camp (MartinCleaver, et al) that thinks that a new set of code is necessary (at least not yet). If we are talking moving to database-driven TWiki, then that is a significant departure that is likely no longer interoperable with existing TWiki without major data massaging, and a totally separate development branch will likely be necessary. I have worked on both types of systems and have migrated file-based to MySQL-based. They work better using the database, and this app probably will too, but there is a paradigm shift that must occur before such an endeavor is possible, and sometimes new blood is necessary.

Yet, I am still worried since getUrl() is not testable at SourceForge, it may continue to be a stumbling block. That is one more reason that using a standard package seems prudent.

(I will be testing the Alpha release shortly.)

-- RaymondLutz - 12 Nov 2003

Two issues me thinks. First is that Include URL is slow - this is a bug that needs to be fixed in current and perhaps specific previous versions (the bug causes TWiki to 'spin' consuming CPU) Second is replacing getUrl and perhaps other functions with standard Perl bits. I do not mean to imply one way or the other any advocacy on the quality of getUrl vs. LWP.

I take no stand on the LWP issue (I did for about 5 seconds but quickly edited it out) because my attachment to TWiki is because of its capabilities, not it's implementation. The only concern I would have over switching to standard modules are mostly resolved by the information RaymondLutz presents. My only remaining concern is with retaining compatibility with existing plugins, etc.

I agree with RaymondLutz's comment that 'If you decide not to use a standard piece of code on CPAN, you should have some reason' with the caveat that (a) you don't fix what ain't broken unless fixing it the goal in general and (b) steps are taken to insure backwards compatibility or that your plans take into consideration migrating existing bits.

I think (a) is the issue being discusses, and again I'll say I take no stand. I think (b) hasn't been fully addressed.

-- PaulPetterson - 12 Nov 2003

Paul, concerning your comment: my attachment to TWiki is because of its capabilities, not it's implementation: I wholeheartedly agree. However, as I pointed out in my original comment, TWiki's cababilities could be much greater (INCLUDE via ftp, nntp, http-ssl, digest authentication, etc) with practically no effort on the development team's part if they were using LWP instead of what is currently hand-coded. Plus, there never would have been the bug of the slow INCLUDE which required fixing in the first place because LWP solved this problem quite a while ago. big grin One last thing: by using LWP, the INCLUDE in your example page could be coded using LWP to occur in parallel. Fixing TWiki's code dropped the time from 50 seconds to 15 - wouldn't you like to drop it down to 3-4 seconds without caching? wink How about handling cookies from the hosts included? Or a compressed response stream? Even caching becomes trivial (getstore). Can you imagine how long it would take to implement some of these LWP built-in capabilities in TWiki without using LWP?

As far as I am concerned, an INCLUDE{} should do just that - include a web page - As a user, it would be great if TWiki would just do what I mean. But, as it stands now, only one case of web pages (two if you count only simple proxies) is handled by TWiki. It doesn't particularly matter to an end-user why can't they just type something like:

%INCLUDE{POST <nop>"https://UserId:Password@www.SiteNeedingCookies.com:943/SomeFormPage.html?parm1=blah"}

All they know is that it doesn't work properly, even though, in their mind, it should. Only the administrator is concerned during installation (read: once and only once) whether the required perl modules are in place.

The concern as to whether a particular host has a CPAN module already installed (a moot point with LWP) can be trivially solved by including it with the TWiki distribution. (BOOM! instant "CORE" developers on the team).

-- TomKagan - 14 Nov 2003


I'm inclined to side with the 'use LWP' argument, however I'd like to see PeterThoeny's other points addressed too. To recap:

  • It boils down to startup perfomance (cound # lines to compile and interpret in LWP vs. the 40 lines in TWiki::Net::getUrl) [sic]
    • has not been addressed
  • quality
    • has been addressed (though perhaps not to everyone's satisfaciton). CPAN, as a whole, is an unknown entity. LWP, on the other hand, is of exceptional quality.
  • extra admin skills needed to install CPAN modules (vs. unzip TWiki distro)
    • partially addressed: "just include required CPAN modules in twiki" OKay, that's fine, but how? Is it really that simple? What about the (common) situation where the host will already have them installed? which version is used? No code has been provided as an example.
  • secure sites with TWiki not connected to the Internet (just Intranet).
    • not addressed. (perhaps is covered by including required modules?)

-- MattWilkie - 14 Nov 2003

Good questions, MattWilkie. I don't understand the fuss about the last two questions. Maybe someone else has some other information.

  • Extra admin skills:
    LWP is far EASIER to install than TWiki is, and if you can install TWiki, you will already be able to install anything from CPAN. Here is what it takes:


   # cpan
   cpan shell -- CPAN exploration and modules installation (v1.76)
   ReadLine support enabled

   cpan> install LWP

That's it! (you have to answer yes three times as it installs and tests itself). I only wish TWiki was as easy. If you don't have root access, I don't know of a hosting service that provides Perl that does not provide CPAN modules. They are a must for many perl applications, especially LWP. Documenting this install is the way to handle it, and is the least of our worries when it comes to making TWiki easy to install.

  • Use on Intranet: I don't know why someone would need to use INCLUDE to access an arbitrary web page if you are located on an intranet with no means to access CPAN. If you don't have a means to install LWP, you certainly can't use INCLUDE in that manner either. So it doesn't even make sense to ask the question. It is a hypothetical concern that will never happen.

  • On the first issue, Startup Performance: The best way to confirm these suspicions is to test it to see if indeed this is valid or is just paranoia. PeterThoeny may have a very good point. For the sake of discussion, assume that he is right, and LWP is somewhat slower in startup. (I doubt it is significant, as I have tried it and can see no appreciable difference.) Then, you would have to balance the drawback of several microseconds or even milliseconds compared with the robustness and functionality of LWP. In my mind anyway, the balance is pegged.

-- RaymondLutz - 14 Nov 2003

I installed the TWikiAlphaRelease (Version 20031111) and the response is faster. Apparently, the fix strongly asserted to fix this one problem does the trick. Again, I think this is pushing everything in the wrong direction. It needs caching to be fast. See MisplacedEfficiency for some preaching.

-- RaymondLutz - 18 Nov 2003

> I don't know of a hosting service that provides Perl that does not provide CPAN modules

can this be proven? E.g. does anybody have counter examples? And how do I test my host (http://freeshell.org/) to find out?

-- MattWilkie - 18 Nov 2003

TWiki already demands that you use some CPAN modules, such as

use CGI::Carp qw( fatalsToBrowser );
use CGI;
use Time::Local;

Therefore, if you have NO access to CPAN, you cannot install TWiki! (TWiki did not have to use CGI, but it is a commonly used package that many people like.) By including these three lines in the TWiki release, CPAN is demanded.

The host freeshell.org undoubtedly has CPAN packages installed, as it will advertizes various functions that would not be available without it. And why wouldn't a hosting service provide it, since it is free to provide, and they can say that they provide lots of functionality. But, not all of CPAN is of any value. Usually, if a host advertizes that they support Perl, this also means that CPAN modules are available, and many are preinstalled. RedHat distributions come with many pre-installed CPAN packages, for example.

You can check for LWP (or any other package) as follows:

1. Obtain SSH access. They advertise that this is available. 2. Do the following:

bash-2.05a$ perl -de 0
Default die handler restored.

Loading DB routines from perl5db.pl version 1.07
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(-e:1):   0
  DB<1> use LWP;

Since no error messages were displayed, this means that LWP is loaded.

If we try a bogus name: as result like the following is displayed:

DB<1> use uisodfu;
Can't locate uisodfu.pm in @INC (@INC contains: /usr/lib/perl5/5.6.1/i386-linux /usr/lib/perl5/5.6.1 /usr/lib/perl5/site_perl/5.6.1/i386-linux /usr/lib/perl5/site_perl/5.6.1 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.6.1/i386-linux /usr/lib/perl5/vendor_perl/5.6.1 /usr/lib/perl5/vendor_perl .) at (eval 14)[/usr/lib/perl5/5.6.1/perl5db.pl:1521] line 2.
BEGIN failed--compilation aborted at (eval 14)[/usr/lib/perl5/5.6.1/perl5db.pl:1521] line 2.

On my system, I find LWP here:

/usr/lib/perl5/site_perl/5.6.1/LWP.pm

I run a hosting service and supply CPAN modules as needed. We have the following CPAN modules installed as STANDARD on all sites we sell.

(see attached list standard-cpan-pkgs.txt)

I think you get the idea. This is not uncommon nor anything special. To be presenting a case that CPAN is something that is difficult, unusual, expensive, or anything else is absurd.

Indeed, even the current so-called hand-coded TWiki::Net.pm package relies on CPAN packages. Socket and MIME::Base64 are used. Also, the password setting stuff probably uses MD5, another widely used module.

Therefore, there is no basis to say that CPAN is not used. Some packages are. The question is how common is the package you will use. The question is not whether CPAN is used but whether LWP is available. Indeed, it is not impossible to create a hosting service that does not provide this, but I will guarantee that someone will want it within a short time, like an hour.

I rest my case.

-- RaymondLutz - 18 Nov 2003

Okay, thanks for the instructions. Now I know that I do have access to LWP for one of my sites. I think to move this forward the test needs to be setup as an easy to run script (maybe by modifying testenv) so that data gathering can start. (and sites without ssh can be included). Unless there is proof that no significant portion of twiki users will be adversely affected this won't go anywhere.

Of course there us still the possible startup performance question to be explored.

-- MattWilkie - 19 Nov 2003

To determine what perl modules are available on a system, you can try 'perldiver': http://www.scriptsolutions.com/programs/free/perldiver/

-- TomKagan - 19 Nov 2003

I was curious about this claim about startup times and performance...

Wow! I just did an investigative test for startup times, with very unusual result. Here's what I did:

  • I had my LWP hacked version already running on my globalswdev.org twiki. I will call this LWP. It has more plugins installed, etc.
  • I installed the alpha release and used data from an existing Twiki site. This uses the "corrected" TWiki::Net::getUrl() routine.
  • I created a perl script that would call each of these from the command line numerous times, timing the complete operation.
  • I created a page that was common to both sites.
  • If multiple INCLUDEs are used, I used different URLs so that caching would not be used.
  • I used the same server which does not use mod_perl.
  • I ran the test in separate SSH shells at the same time so network traffic would be about the same.

Here is the time for 3 INCLUDEs and 30 startups (the time shown is totaltime/30 = time for one invocation).

  • elapsed time for LWP = 1.93333333333333
    • Error is displayed: Use of uninitialized value in concatenation (.) or string at ../lib/TWiki/Plugins/TopicVarsPlugin.pm line 131.
  • elapsed time for TWiki::Net::getURL = 4.3 (This must be a aberration).
    • Error is displayed: Argument "www.cia.gov" isn't numeric in subroutine entry at /usr/lib/perl5/5.6.1/i386-linux/Socket.pm line 442.

I changed the topic pages so that only one INCLUDE is used on each one, and used more loops to try to discern startup time. I also removed the TopicVarsPlugin.pm so no error messages would be generated. Now, no error messages from LWP version and same Socket.pm error message shown above from alpha release version.

Here is the time for 1 INCLUDEs and 100 startups (time is for for one invocation). I did it three times for each one.

  • elapsed time for TWiki::Net::getURL = 1.61, 1.51, 1.62 (Average = 1.58)
  • elapsed time for LWP = 1.71, 1.66, 1.25 (Average = 1.54)

OK, I changed the topic so that it had 10 different INCLUDEs and envoked them 10 times each. I repeated the test three times.

  • elapsed time for TWiki::Net::getURL = 5.8 (.58 secs per INCLUDE), 6.2, 6.5 (Average 6.17)
  • elapsed time for LWP = 5 (0.5 secs per INCLUDE), 5, 5.5 (Average = 5.16)

CONCLUSION

My engineering conclusion is this:

  • There is no significant difference in startup times, and LWP performed at least as well as the getUrl approach. Therefore, the startup time concern is bogus.
  • The inclusion time for LWP (without caching) was significantly faster than getUrl version (20% faster).
  • LWP is superior in quality to the getUrl routine in TWiki::Net as there is no warning error message.
  • LWP can provide other protocols, etc.

--> I maintain the assertion that LWP should be used instead of the homebrew getUrl() function.

-- RaymondLutz - 19 Nov 2003

Okay so now the next questions are:

  • Can LWP be used in all the same places (hosting environments) that the current twiki is found? e.g. What will it take to upgrade existing deployments (just drop in a new Net.pm?)? Can LWP really be used in all the environments twiki is now?
  • Anybody willing/able to pony up a getUrl()-->LWP patch for testing?
    • does anybody have an hosting environment which produces signaficantly different results from Raymond's? (does LWP consistently meet/exceed getUrl()?)

-- MattWilkie - 20 Nov 2003

Wow, this topic is attracting lots of attention! OK, here is my take:

The TWikiSystemRequirements states that Perl 5.005_03 is required. LWP is shipped in this version, see http://www.perldoc.com/perl5.005_03/lib.html. That means, it can be assumed that LWP is already installed. With that, extra admin skills needed to install CPAN modules and secure sites with TWiki not connected to the Internet do not apply.

The question is performance. I did a small test and replaced Net::getUrl with some quick and dirty LWP code:

sub getUrl
{
    my ( $theHost, $thePort, $theUrl, $theUser, $thePass, $theHeader ) = @_;

    use LWP;
    my $ua = new LWP::UserAgent;
    $ua->agent( "test/1.0" );
    my $request = new HTTP::Request( "GET", "http://$theHost/$theUrl" );
    my $response = $ua->request($request);
    my $text = $response->content;
    return "content-type: text/html\n\n$text";
}

This code does no error check, it is just here for timing. On an unused server where TWiki is installed I created an EmptyTopic, that is, there is no rendering besides the default skin decoration. Then I created a small static HTML page that just has one header, and included that one in an IncludeHtml topic. I created small pages because they can reveil the overhead by getUrl and LWP. All access was on the same server, this is to exclude other factors. I measured the time 10 times using time wget http://1.2.3.4/cgi-bin/view/Sandbox/EmptyPage and took the average. Here is the result:

Time in seconds
getUrl EmptyTopic IncludeHtml
latest Alpha code 0.792 0.805
quick & dirty LWP 0.872 1.041

That is, when using LWP we get a 10% performance drop for pages that do not include a URL and a 30% drop for pages that include a URL. Most of this can be attributed to the compile time of LWP. Now, the code can be tuned, for example with a require LWP instead of a use LWP.

The bottom line: It would be nice to use LWP because of the additional features. We can replace getUrl with LWP

  • if there is no measurable performance drop for topics that do not include a URL, and
  • if the performance drop is within, say, 10% for topics that include a URL

We could introduce a switch in TWiki.cfg to use the internal getUrl or LWP; internal would be the default (if the performance cannot be met)

-- PeterThoeny - 20 Nov 2003

it can be assumed that LWP is already installed That is great news, and matches my intuition about LWP. It is "Part" of Perl.

Your figures don't agree with my results, and I am wondering why, although it is not a big deal. I saw NO performance drop when using LWP, only an improvement. If you do lots of INCLUDES in one topic, that would more clearly show the performance of just LWP vs. the homebrew getUrl(). I would encourage you to try that to see if your result matches my result, i.e. about 20% performance improvement for LWP. Questioning whether we get a drop with pages that use INCLUDE vs those that don't SHOULD be able to be resolved with proper coding of the system to compile code only when needed, and therefore, is not really central to the issue on the table. Also, I don't think it is fair to include a tiny page from your own site. You need to get a larger page so that any data manipulation inefficiencies will be uncovered. (I thought you said you couldn't test this on the SourceForge server.)

I would like to propose the following steps with regard to %INCLUDE%:

  • Repartition the code, moving all of the include functionality to Include.pm. The splitting of the URL into $theHost, $thePort, $theUrl, etc should be kept inside the Include.pm module, as there is no good reason to include that in the core TWiki.pm. To be safe, I suggest that we keep the functionality identical to the current functionality. This change is strictly a "under the hood" change which continues to use homebrew getUrl().

  • The TWiki::Net::getUrl uses an unusual interface, i.e. we are splitting up the url just to reassemble it so that it is appropriate for the code that exists today. Instead, the getUrl function should do the split-up internally, so that the interface to it will be the same as for the typical implementation utilizing LWP. That way, the change from getUrl to LWP, if desired, can be wholly contained in Net.pm

(Since I have been such a squeaky wheel, I would be willing to provide a patch for these changes and new Include.pm module. Let me know.)

  • INCLUDE could stand some other feature improvements. Being able to substitute in a new Include.pm module to obtain those features would help a lot. Here are key changes that should be considered:

    • Page caching. I see this as essential to gain additional performance. Operationally, once a page is read from a web site, it is stored in a hash, with the url as the key to the hash. If other regexps parts of this page are included, the hash is checked before the pages is reaccessed from the remote site. See the example in ParameterizedIncludes for a typical case when such caching is essential. It may be argued that this feature belongs in Net.pm, and that is perhaps a very good idea.

    • Inclusion Searchable. Modify the Topic to show the actual included text, if so desired, with an expiration time. See SearchInINCLUDE.

    • ParameterizedIncludes: Allow actual parameters to be passed to included topics, that will be expanded there only one time. This results in a sort of RTC (remote topic call) to coin a term, with the included topic acting like a subroutine. See ParameterizedIncludes.

(There is weirdness in the parsing of the TWiki Markup above. I asked for only one blank line between the above bullets and for some reason I get two. Decreasing to no blank lines between source causes no blank lines in output. It is not possible to get only one blank line between the bullets! Sanity check please!) -- sanity: I only get one blank line between each bullet (Moz-Firebird 0.7/Win) -- MattWilkie - 21 Nov 2003
OK. It's a problem. See BulletsScrewyWithGaps. -- RaymondLutz - 21 Nov 2003

-- RaymondLutz - 20 Nov 2003

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt standard-cpan-pkgs.txt r1 manage 8.6 K 2003-11-19 - 13:18 MattWilkie  
Edit | Attach | Watch | Print version | History: r38 < r37 < r36 < r35 < r34 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r38 - 2003-11-21 - RaymondLutz
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.