[SOLVED] Apache environment variables - logical AND
Ian Smith
smithi at nimnet.asn.au
Sat Nov 8 09:02:20 PST 2008
On Wed, 5 Nov 2008, Jeremy Chadwick wrote:
> On Wed, Nov 05, 2008 at 08:24:16PM +1100, Ian Smith wrote:
> > On Tue, 4 Nov 2008, Jeremy Chadwick wrote:
> > > On Wed, Nov 05, 2008 at 05:33:45PM +1100, Ian Smith wrote:
> > > > I know this isn't FreeBSD specific - but I am, so crave your indulgence.
> > > >
> > > > Running Apache 1.3.27, using a fairly extensive access.conf to beat off
> > > > the most rapacious robots and such, using mostly BrowserMatch[NoCase]
> > > > and SetEnvIf to moderate access to several virtual hosts. No problem.
> > > >
> > > > OR conditions are of course straighforward:
> > > >
> > > > SetEnvIf <condition1> somevar
> > > > SetEnvIf <condition2> somevar
> > > > SetEnvIf <exception1> !somevar
> > > >
> > > > What I can't figure out is how to set a variable3 if and only if both
> > > > variable1 AND variable2 are set. Eg:
> > > >
> > > > SetEnvIf Referer "^$" no_referer
> > > > SetEnvIf User-Agent "^$" no_browser
> > > >
> > > > I want the equivalent for this (invalid and totally fanciful) match:
> > > >
> > > > SetEnvIf (no_browser AND no_referer) go_away
> > >
> > > Sounds like a job for mod_rewrite. The SetEnvIf stuff is such a hack.
That's true. Thanks for your considered and helpful tutorial. I do use
ipfw+dummynet for bandwidth limiting, and ipfw table 80 to house bogons.
But I finally figured out how to make such a hack work .. it just kept
on bugging me until I woke up remembering some very basic logic; quite
embarrassing really ..
# 9/11/8: preset env vars to be tested by value
SetEnvIf Referer ".*" no_ref=0 no_bro=0 both=1
SetEnvIf Referer "^$" no_ref=1
SetEnvIf User-Agent "^$" no_bro=1
# duh, logic 101: a AND b = NOT ( (NOT a) OR (NOT b) )
SetEnvIf no_ref 0 both=0
SetEnvIf no_bro 0 both=0
SetEnvIf both 1 go_away
It's a bit round about and awkward but seems to work fine, and this was
just one example of several combination conditions I'd like to test.
cheers, Ian
> > It may be a hack, but I've found it an extremely useful one so far.
> >
> > > This is what we use on our production servers (snipped to keep it
> > > short):
> > >
> > > RewriteEngine on
> > > RewriteCond %{HTTP_REFERER} ^XXXX: [OR]
> > > RewriteCond %{HTTP_REFERER} ^http://forums.somethingawful.com/ [OR]
> > > RewriteCond %{HTTP_REFERER} ^http://forums.fark.com/ [OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [NC,OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
> > > RewriteCond %{HTTP_USER_AGENT} ^Xaldon.WebSpider
> > > RewriteRule ^.* - [F,L]
> > >
> > > You need to keep something in mind however: blocking by user agent is
> > > basically worthless these days. Most "leeching" tools now let you
> > > spoof the user agent to show up as Internet Explorer, essentially
> > > defeating the checks.
> >
> > While that's true, I've found most of the more troublesome robots are
> > too proud of their 'brand' to spoof user agent, and those that do are a)
> > often consistent enough in their Remote_Addr to exclude by subnet and/or
> > b) often make obvious errors in spoofed User_Agent strings .. especially
> > those pretending to be some variant of MSIE :)
>
> I haven't found this to be true at all, and I've been doing web hosting
> since 1993. In the past 2-3 years, the amount of leeching tools which
> spoof their User-Agent has increased dramatically.
>
> But step back for a moment and look at it from a usability perspective,
> because this is what really happens.
>
> A user tries to leech a site you host, using FruitBatLeecher, which your
> Apache server blocks based on User-Agent. The user has no idea why the
> leech program doesn't work. Does the user simply give up his quest?
> Absolutely not -- the user then goes and finds BobsBandwidthZilla which
> pretends to be Internet Explorer, Firefox, or lynx, and downloads the
> site.
>
> Now, if you're trying to block robots/scrapers which aren't honouring
> robots.txt, oh yes, that almost always works, because those rarely spoof
> their User-Agent (I think to date I've only seen one site which did
> that, and it was some Russian search engine).
>
> If you feel I'm just doing burn-outs arguing, a la "BSD style", let me
> give you some insight to how often I deal with this problem: daily.
>
> We host a very specific/niche site that contains over 20 years of
> technical information on the Famicom / Nintendo Entertainment System.
> The site has hundreds of megabytes of information, and a very active
> forum. Some jackass comes along and decides "Wow, this has all the info
> I want!" and fires off a leeching program against the entire
> domain/vhost. Let's say the program he's using is blocked by our
> User-Agent blocks; there is a 6-7 minute delay as the user goes off to
> find another program to leech with, installs it, and attempts it again.
> Pow, it works, and we find nice huge spikes in our logs for the vhost
> indicating someone got around it. I later dig through our access_log and
> find that he tried to use FruitBatLeecher, which got blocked, but then
> 6-7 minutes later came back with a leeching client that spoofs itself
> as IE.
>
> And it gets worse.
>
> Many of these leeching programs get stuck in infinite loops when it
> comes to forum software, so they sit there pounding on the webserver
> indefinitely. It requires administrator intervention to stop it; in my
> case, I don't even bother with Apache ACLs, because ~70% of the time
> the client ignores 403s and keeps bashing away (yes really!) -- I go
> straight for a pf-based block in a table called <web-leechers>. These
> guys will hit that block for *days* -- that should give you some idea
> how long they'll let that program run.
>
> But it gets worse -- again.
>
> Recently, I found two examples of very dedicated leechers. One was an
> individual out of China (or using Chinese IPs -- take your pick), and
> another was at an Italian university. These individuals got past the
> User-Agent blocks, and I caught their leeching software stuck in a loop
> on the site forum. I blocked their IPs with pf, thinking it would be
> enough, then went to sleep. I woke up the following evening to find
> they were back at it again. How?
>
> The Chinese individual literally got another IP somehow, in a completely
> different netblock; possibly a DHCP release/renew, possibly some friend
> of his, whatever.
>
> The Italian university individual was successful in his leech attempts
> exactly 50% of the time -- because their university used a transparent
> HTTP proxy that was balanced between two IPs. I had only blocked one
> of them.
>
> Starting to get the picture now? :-)
>
> The only effective way to deal with all of this is rate-limiting. I do
> not advocate "queues" or "buckets", or "dynamic buckets" where each IP
> is allocated X number of simultaneous sockets, and if they exceed that,
> they get rate-limited. I also do not advocate "shared queues", where
> if there are X number of sockets, allow Z amount of bandwidth, but if
> X is more than, say, 200 sockets, allow Z/2 amount of bandwidth.
>
> The tuning is simply not worth it -- people will go to great lengths
> to screw you. And if your stuff is in a 95th-percentile billing
> environment, believe me, you DO NOT want to wake up one morning to
> find that someone has cost you thousands of dollars.
>
> Also, I recommend using ipfw dummynet or pf ALTQ for rate-limiting. The
> few Apache bandwidth-limiting modules I've tried have bizarre side
> effects. Here's a forum post of mine (on the above site) explaining
> why we moved away from mod_cband and went with pf ALTQ.
>
> http://nesdev.parodius.com/bbs/viewtopic.php?t=4184
>
>
> > > If you're that concerned about bandwidth (which is why a lot of people
> > > do the above), consider rate-limiting. It's really, quite honestly, the
> > > only method that is fail-safe.
> >
> > Thanks Jeremy. Certainly time to take the time to have another look at
> > mod_rewrite, especially regarding redirection, alternative pages etc,
> > but I still tend to glaze over about halfway through all that section.
>
> Yeah, I agree, the mod_rewrite documentation is overwhelming, and that
> turns a lot of people off. The examples I gave you should allow you to
> look up each piece of the directive at a time, and once you do that,
> it'll all make sense.
>
> > And unless I've completely missed it, your examples don't address my
> > question, being how to AND two or more conditions in a particular test?
> >
> > If I really can't do this with mod_setenvif I'll have to take that time.
>
> You can't do it with mod_setenvif. You can do it with mod_rewrite,
> because all mod_rewrite rules default to an operator type of "AND". The
> [OR] you see in my rules is an explicit override for obvious reasons.
>
> Open the Apache 1.3 mod_rewrite docs and search for "implicit AND".
> It'll all make sense then. :-)
>
> I hope some of what I've said above gives you something to think about.
> Hosting environments are a real pain in the ass; when it's "just you and
> your own personal box" it's easy, but when it's larger scale and
> involves users (customers or friends, doesn't matter), it's a totally
> different game.
>
> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, USA |
> | Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-questions
mailing list