[SOLVED] Apache environment variables - logical AND

Sat Nov 8 09:02:20 PST 2008

On Wed, 5 Nov 2008, Jeremy Chadwick wrote:
 > On Wed, Nov 05, 2008 at 08:24:16PM +1100, Ian Smith wrote:
 > > On Tue, 4 Nov 2008, Jeremy Chadwick wrote:
 > >  > On Wed, Nov 05, 2008 at 05:33:45PM +1100, Ian Smith wrote:
 > >  > > I know this isn't FreeBSD specific - but I am, so crave your indulgence.
 > >  > > 
 > >  > > Running Apache 1.3.27, using a fairly extensive access.conf to beat off 
 > >  > > the most rapacious robots and such, using mostly BrowserMatch[NoCase] 
 > >  > > and SetEnvIf to moderate access to several virtual hosts.  No problem.
 > >  > > 
 > >  > > OR conditions are of course straighforward:
 > >  > > 
 > >  > >   SetEnvIf <condition1> somevar
 > >  > >   SetEnvIf <condition2> somevar
 > >  > >   SetEnvIf <exception1> !somevar
 > >  > > 
 > >  > > What I can't figure out is how to set a variable3 if and only if both 
 > >  > > variable1 AND variable2 are set.  Eg:
 > >  > > 
 > >  > >   SetEnvIf Referer "^$" no_referer
 > >  > >   SetEnvIf User-Agent "^$" no_browser
 > >  > > 
 > >  > > I want the equivalent for this (invalid and totally fanciful) match: 
 > >  > > 
 > >  > >   SetEnvIf (no_browser AND no_referer) go_away
 > >  > 
 > >  > Sounds like a job for mod_rewrite.  The SetEnvIf stuff is such a hack.

That's true.  Thanks for your considered and helpful tutorial.  I do use 
ipfw+dummynet for bandwidth limiting, and ipfw table 80 to house bogons.

But I finally figured out how to make such a hack work .. it just kept 
on bugging me until I woke up remembering some very basic logic; quite 
embarrassing really ..

        # 9/11/8: preset env vars to be tested by value
SetEnvIf Referer ".*" no_ref=0 no_bro=0 both=1
SetEnvIf Referer    "^$" no_ref=1
SetEnvIf User-Agent "^$" no_bro=1
        # duh, logic 101: a AND b = NOT ( (NOT a) OR (NOT b) )
SetEnvIf no_ref 0 both=0
SetEnvIf no_bro 0 both=0
SetEnvIf both 1 go_away

It's a bit round about and awkward but seems to work fine, and this was 
just one example of several combination conditions I'd like to test.

cheers, Ian

 > > It may be a hack, but I've found it an extremely useful one so far.
 > >
 > >  > This is what we use on our production servers (snipped to keep it
 > >  > short):
 > >  > 
 > >  > RewriteEngine on
 > >  > RewriteCond %{HTTP_REFERER} ^XXXX:                      [OR]
 > >  > RewriteCond %{HTTP_REFERER} ^http://forums.somethingawful.com/  [OR]
 > >  > RewriteCond %{HTTP_REFERER} ^http://forums.fark.com/    [OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^Alexibot                [OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^asterias                [OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot             [OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^Black.Hole              [NC,OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE                [OR]
 > >  > RewriteCond %{HTTP_USER_AGENT} ^Xaldon.WebSpider
 > >  > RewriteRule ^.* - [F,L]
 > >  > 
 > >  > You need to keep something in mind however: blocking by user agent is
 > >  > basically worthless these days.  Most "leeching" tools now let you
 > >  > spoof the user agent to show up as Internet Explorer, essentially
 > >  > defeating the checks.
 > > 
 > > While that's true, I've found most of the more troublesome robots are 
 > > too proud of their 'brand' to spoof user agent, and those that do are a) 
 > > often consistent enough in their Remote_Addr to exclude by subnet and/or 
 > > b) often make obvious errors in spoofed User_Agent strings .. especially 
 > > those pretending to be some variant of MSIE :)
 > 
 > I haven't found this to be true at all, and I've been doing web hosting
 > since 1993.  In the past 2-3 years, the amount of leeching tools which
 > spoof their User-Agent has increased dramatically.
 >
 > But step back for a moment and look at it from a usability perspective,
 > because this is what really happens.
 > 
 > A user tries to leech a site you host, using FruitBatLeecher, which your
 > Apache server blocks based on User-Agent.  The user has no idea why the
 > leech program doesn't work.  Does the user simply give up his quest?
 > Absolutely not -- the user then goes and finds BobsBandwidthZilla which
 > pretends to be Internet Explorer, Firefox, or lynx, and downloads the
 > site.
 > 
 > Now, if you're trying to block robots/scrapers which aren't honouring
 > robots.txt, oh yes, that almost always works, because those rarely spoof
 > their User-Agent (I think to date I've only seen one site which did
 > that, and it was some Russian search engine).
 >
 > If you feel I'm just doing burn-outs arguing, a la "BSD style", let me
 > give you some insight to how often I deal with this problem: daily.
 >
 > We host a very specific/niche site that contains over 20 years of
 > technical information on the Famicom / Nintendo Entertainment System.
 > The site has hundreds of megabytes of information, and a very active
 > forum.  Some jackass comes along and decides "Wow, this has all the info
 > I want!" and fires off a leeching program against the entire
 > domain/vhost.  Let's say the program he's using is blocked by our
 > User-Agent blocks; there is a 6-7 minute delay as the user goes off to
 > find another program to leech with, installs it, and attempts it again.
 > Pow, it works, and we find nice huge spikes in our logs for the vhost
 > indicating someone got around it.  I later dig through our access_log and
 > find that he tried to use FruitBatLeecher, which got blocked, but then
 > 6-7 minutes later came back with a leeching client that spoofs itself
 > as IE.
 > 
 > And it gets worse.
 >
 > Many of these leeching programs get stuck in infinite loops when it
 > comes to forum software, so they sit there pounding on the webserver
 > indefinitely.  It requires administrator intervention to stop it; in my
 > case, I don't even bother with Apache ACLs, because ~70% of the time
 > the client ignores 403s and keeps bashing away (yes really!) -- I go
 > straight for a pf-based block in a table called <web-leechers>.  These
 > guys will hit that block for *days* -- that should give you some idea
 > how long they'll let that program run.
 >
 > But it gets worse -- again.
 > 
 > Recently, I found two examples of very dedicated leechers.  One was an
 > individual out of China (or using Chinese IPs -- take your pick), and
 > another was at an Italian university.  These individuals got past the
 > User-Agent blocks, and I caught their leeching software stuck in a loop
 > on the site forum.  I blocked their IPs with pf, thinking it would be
 > enough, then went to sleep.  I woke up the following evening to find
 > they were back at it again.  How?
 > 
 > The Chinese individual literally got another IP somehow, in a completely
 > different netblock; possibly a DHCP release/renew, possibly some friend
 > of his, whatever.
 >
 > The Italian university individual was successful in his leech attempts
 > exactly 50% of the time -- because their university used a transparent
 > HTTP proxy that was balanced between two IPs.  I had only blocked one
 > of them.
 > 
 > Starting to get the picture now?  :-)
 >
 > The only effective way to deal with all of this is rate-limiting.  I do
 > not advocate "queues" or "buckets", or "dynamic buckets" where each IP
 > is allocated X number of simultaneous sockets, and if they exceed that,
 > they get rate-limited.  I also do not advocate "shared queues", where
 > if there are X number of sockets, allow Z amount of bandwidth, but if
 > X is more than, say, 200 sockets, allow Z/2 amount of bandwidth.
 > 
 > The tuning is simply not worth it -- people will go to great lengths
 > to screw you.  And if your stuff is in a 95th-percentile billing
 > environment, believe me, you DO NOT want to wake up one morning to
 > find that someone has cost you thousands of dollars.
 >
 > Also, I recommend using ipfw dummynet or pf ALTQ for rate-limiting.  The
 > few Apache bandwidth-limiting modules I've tried have bizarre side
 > effects.  Here's a forum post of mine (on the above site) explaining
 > why we moved away from mod_cband and went with pf ALTQ.
 > 
 > http://nesdev.parodius.com/bbs/viewtopic.php?t=4184
 >
 > 
 > >  > If you're that concerned about bandwidth (which is why a lot of people
 > >  > do the above), consider rate-limiting.  It's really, quite honestly, the
 > >  > only method that is fail-safe.
 > > 
 > > Thanks Jeremy.  Certainly time to take the time to have another look at 
 > > mod_rewrite, especially regarding redirection, alternative pages etc, 
 > > but I still tend to glaze over about halfway through all that section.
 > 
 > Yeah, I agree, the mod_rewrite documentation is overwhelming, and that
 > turns a lot of people off.  The examples I gave you should allow you to
 > look up each piece of the directive at a time, and once you do that,
 > it'll all make sense.
 > 
 > > And unless I've completely missed it, your examples don't address my 
 > > question, being how to AND two or more conditions in a particular test?
 > >
 > > If I really can't do this with mod_setenvif I'll have to take that time.
 > 
 > You can't do it with mod_setenvif.  You can do it with mod_rewrite,
 > because all mod_rewrite rules default to an operator type of "AND".  The
 > [OR] you see in my rules is an explicit override for obvious reasons.
 > 
 > Open the Apache 1.3 mod_rewrite docs and search for "implicit AND".
 > It'll all make sense then.  :-)
 > 
 > I hope some of what I've said above gives you something to think about.
 > Hosting environments are a real pain in the ass; when it's "just you and
 > your own personal box" it's easy, but when it's larger scale and
 > involves users (customers or friends, doesn't matter), it's a totally
 > different game.
 > 
 > -- 
 > | Jeremy Chadwick                                jdc at parodius.com |
 > | Parodius Networking                       http://www.parodius.com/ |
 > | UNIX Systems Administrator                  Mountain View, CA, USA |
 > | Making life hard for others since 1977.              PGP: 4BD6C0CB |