SquidForOfflineBrowsing < Wikilearn

Thinking about a way to use squid for offline browsing, I collected and copied these posts (after a Google search).

Tentative conclusion: We should be able to do what I envision using Squid. (That is, surf online to collect web pages, and then switch to an offline mode during class hours and allow the class to surf the pages collected during the online mode.) There is an offline_mode option in the squid.conf file. Based on my reading, I suspect Squid will also work for offline simulation of ftp downloads.

I originally created this on my private home TWiki. Often my intent would be to refactor a page like this before moving it to Wikilearn, but I wanted to get things moving re the ChurchServerProject.

See AboutThesePages.

Contents:

Parent of the thread
mirroring via wget
via Squid -- quoted below
Parent / Child Squid servers
"offline_mode in the squid.conf file"
useful addendum to previous

Parent of the thread

http://marc.theaimsgroup.com/?l=squid-users&m=98337722331973&w=2

-- not exactly on topic -- not quoted here

mirroring via wget

http://marc.theaimsgroup.com/?l=squid-users&m=98353395917009&w=2

<quote>
List:     squid-users
Subject:  Re: [SQU] Interesting Question
From:     "Robert Collins" <robert.collins@itdomain.com.au>
Date:     2001-03-02 11:45:10
[Download message RAW]

What about mirroring the content?
teacher goes online, adds link to a database

wget runs every 5 minutes via cron, and mirrors any updated pages &
linked non-[asp/cgi/pl/htm/html/php/php3..] pages to your local
webserver.

students get and index page with links to your mirror pages

Rob
</quote>

via Squid -- quoted below

http://marc.theaimsgroup.com/?l=squid-users&m=98353680725019&w=2

<quote>
List:     squid-users
Subject:  Re: [SQU] Interesting Question
From:     Joe Cooper <joe@swelltech.com>
Date:     2001-03-02 12:37:23
[Download message RAW]

How about just 'filling' the cache with desired pages and then going 
into Offline Mode?  Empty the cache before each 'fill' session to insure
nothing unapproved remains...then visit all the needed pages.  This isn't
quite as good an answer as Robert gave (wget) but it's simpler in some 
cases.  (Robert's solution will answer the problem of dynamic pages not 
being cached, though, if done properly.)  Though I suppose you could also 
fiddle with the nocache directives to tweak Squid to cache everything, 
even dynamic content.

The other solution would be to collect the URL's of every page visited 
from the access.log and explicitly allow /only/ those addresses via an 
ACL file or a SquidGuard database.  This could be scripted.
</quote>

Parent / Child Squid servers

http://marc.theaimsgroup.com/?l=squid-users&m=98356916906090&w=2

-- this thought actually crossed my mind while reading the Squid FAQs

<quote>
List:     squid-users
Subject:  Re: [SQU] Interesting Question
From:     Jon Mansey <jon@interpacket.net>
Date:     2001-03-02 21:28:48
[Download message RAW]

Maybe a simple solution is to use 2 squid caches, one which Teacher 
uses has full access to the web, a second which the students PCs are 
directed to, can only get pages from the teacher's parent cache, it 
cannot resolve misses itself.

no database necessary.

jm
</quote>

"offline_mode in the squid.conf file"

http://marc.theaimsgroup.com/?l=squid-users&m=98390033928556&w=2

-- biggest clue yet -- "offline_mode in the squid.conf file" -- also, some implication that the previous method (teacher / student squids) will not work (I assume that's what they're discussing, but I still believe it would, but this is simpler, if we can keep the contents of the cache from expiring)

&lt;quote>
List:     squid-users
Subject:  Re: [SQU] Interesting Question
From:     Joe Cooper <joe@swelltech.com>
Date:     2001-03-06 17:18:48
[Download message RAW]

One of the methods I suggested was to pre-fill the cache, and then go 
into offline mode.

This /may/ require modification of the no_cache settings in order to 
cache the results of cgi scripts, and other things which are generally 
not cachable.  Experience will have to tell you what to do there, as 
I've never personally done anything like what you want.

In short, here's the steps to a pre-fill:

Clear your cache (rm -rf /cachedir/*, or just format the partition).
Start Squid.  Visit the pages that are needed for the class or whatever.
Turn on offline_mode in the squid.conf file.
Restart Squid.
Browse those pages.  offline_mode will prevent any other pages from 
being visited.

If all content is static, and none of the content has aggressive expiry 
times, this will work fine.  And is probably the easiest for your 
teachers to use.  You could put up a small cgi script on each system, 
that when called will put the cache into offline mode.  And another to 
empty the cache and put it back into online mode.  Then the teachers 
could click a button to start filling and a button to allow the offline 
browsing.

Next is Robert's suggestion for using wget to create a local mirror. 
Also a good option, but also with some potential problems to be worked 
around.

With wget, you can do what is called a recursive web suck (option 
-r)...by default this will suck down copies of every link down to 5 
levels of links (so each link on each page will be pulled down into a 
local directory).  You can then browse this local directory (you could 
even put it onto a local webserver if you needed it to be shareable 
across the whole school).  The potential problems include absolute links 
in the pages (http://www.yahoo.com/some/page...will jump out of your 
local mirror...whereas some/page will not).

Note that both have their problems...but both will do what you want with 
a little work.  There is no magic button to push to limit in such a 
strict way the internet.  Because resources are so very distributed, it 
is very hard to pick one 'page' and say let's only allow this one page. 
  That page possibly links to and pulls from a hundred servers.  Maybe 
not...but it could.

Devin Teske wrote:

>>>>> Forwarded to Squidusers
>>>> 
> 
>> Joe's pointer (squid cache retention times) & mine (wget from a full
>> access account to make mirrors) will work.
>> 
>> Squid ACLs, Squid redirectors, WILL NOT.
> 
> 
> Can you explain in more detail how I would implement either Joe's or 
> Rob's scenario? How would they both work?
> 
> Thanks,
> Devin Teske
</quote>

useful addendum to previous

http://marc.theaimsgroup.com/?l=squid-users&m=98391810224795&w=2

<quote>
List:     squid-users
Subject:  Re: [SQU] Interesting Question
From:     "Robert Collins" <robert.collins@itdomain.com.au>
Date:     2001-03-06 22:28:46
[Download message RAW]

Thanks Joe,

As an addendum:
a) with squid, don't forget the -z after clearing the cache,
b) you'll need to find some way to add new pages without turning
off_line mode off during school hours.

for wget
a) wget can rewrite absolute page references within the local site to be
relative.
b) to get a single page and all graphics, you'll want -r 1 (the 1st link
off)
c) what is a potential problem is dynamically generated url's - wget
won't run javascript or whatever, so those links won't be sucked down.

And as Joe said, both have problems, but should work fairly well with
some setup effort/testing.
Rob
</quote>

<quote>
</quote>

<quote>
</quote>

<quote>
</quote>

<quote>
</quote>

<quote>
</quote>

Contributors:

RandyKramer - 05 Nov 2001
<If you edit this page, add your name here, move this to the next line>

Topic revision: r1 - 2001-11-26 - RandyKramer

Edit
Attach

Copyright � 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding WikiLearn? WebBottomBar">Send feedback
See TWiki's New Look