Rotating proxy pool for scraping

Well, it’s now legal to scrape LinkedIn.  Sometimes, though, as Lil Wayne has said, real g’s move in silence like lasagna.

One way to do this is to rotate one’s requests through a proxy.  Sometimes it’s rather convenient to just have one IP address to use as a proxy – and have that proxy forward through a round robin rotating pool.  For example, one could lease new proxy pools (via API or manually), without having to update the scraper code.

This is easy to do implement through Squid, an open source proxy.  Just provision an Ubuntu LTS instance, install squid + dependencies (there are many tutorials for Ubuntu LTS), and add this to the top of your squid config file:

# add this to the top of  /etc/squid/squid.conf
# then restart squid (service squid stop, service squid start)

request_header_access Allow allow all
request_header_access Authorization allow all
request_header_access WWW-Authenticate allow all
request_header_access Proxy-Authorization allow all
request_header_access Proxy-Authenticate allow all
request_header_access Cache-Control allow all
request_header_access Content-Encoding allow all
request_header_access Content-Length allow all
request_header_access Content-Type allow all
request_header_access Date allow all
request_header_access Expires allow all
request_header_access Host allow all
request_header_access If-Modified-Since allow all
request_header_access Last-Modified allow all
request_header_access Location allow all
request_header_access Pragma allow all
request_header_access Accept allow all
request_header_access Accept-Charset allow all
request_header_access Accept-Encoding allow all
request_header_access Accept-Language allow all
request_header_access Content-Language allow all
request_header_access Mime-Version allow all
request_header_access Retry-After allow all
request_header_access Title allow all
request_header_access Connection allow all
request_header_access Proxy-Connection allow all
request_header_access User-Agent allow all
request_header_access Cookie allow all
request_header_access All deny all

via off
forwarded_for off
follow_x_forwarded_for deny all

acl deny24_25 random 24/25
acl deny23_24 random 23/24
acl deny22_23 random 22/23
acl deny21_22 random 21/22
acl deny20_21 random 20/21
acl deny19_20 random 19/20
acl deny18_19 random 18/19
acl deny17_18 random 17/18
acl deny16_17 random 16/17
acl deny15_16 random 15/16
acl deny14_15 random 14/15
acl deny13_14 random 13/14
acl deny12_13 random 12/13
acl deny11_12 random 11/12
acl deny10_11 random 10/11
acl deny9_10 random 9/10
acl deny8_9 random 8/9
acl deny7_8 random 7/8
acl deny6_7 random 6/7
acl deny5_6 random 5/6
acl deny4_5 random 4/5
acl deny3_4 random 3/4
acl deny2_3 random 2/3
acl deny1_2 random 1/2
acl deny0_1 random 1/1

never_direct allow all

cache_peer 206.66.98.244 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P25
cache_peer 206.66.98.234 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P24
cache_peer 206.66.98.223 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P23
cache_peer 206.66.98.217 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P22
cache_peer 206.66.98.211 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P21
cache_peer 206.66.98.188 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P20
cache_peer 206.66.98.148 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P19
cache_peer 206.66.98.123 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P18
cache_peer 206.66.98.119 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P17
cache_peer 206.66.98.110 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P16
cache_peer 206.66.98.105 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P15
cache_peer 206.66.98.99 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P14
cache_peer 206.66.98.84 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P13
cache_peer 206.66.98.83 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P12
cache_peer 206.66.98.71 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P11
cache_peer 206.66.98.63 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P10
cache_peer 206.66.98.42 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P9
cache_peer 206.66.98.34 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P8
cache_peer 206.66.98.14 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P7
cache_peer 206.66.98.200 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P6
cache_peer 206.66.98.152 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P5
cache_peer 206.66.98.120 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P4
cache_peer 206.66.98.56 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P3
cache_peer 206.66.98.32 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P2
cache_peer 206.66.98.2 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P1

cache_peer_access P25 deny deny24_25
cache_peer_access P24 deny deny23_24
cache_peer_access P23 deny deny22_23
cache_peer_access P22 deny deny21_22
cache_peer_access P21 deny deny20_21
cache_peer_access P20 deny deny19_20
cache_peer_access P19 deny deny18_19
cache_peer_access P18 deny deny17_18
cache_peer_access P17 deny deny16_17
cache_peer_access P16 deny deny15_16
cache_peer_access P15 deny deny14_15
cache_peer_access P14 deny deny13_14
cache_peer_access P13 deny deny12_13
cache_peer_access P12 deny deny11_12
cache_peer_access P11 deny deny10_11
cache_peer_access P10 deny deny9_10
cache_peer_access P9 deny deny8_9
cache_peer_access P8 deny deny7_8
cache_peer_access P7 deny deny6_7
cache_peer_access P6 deny deny5_6
cache_peer_access P5 deny deny4_5
cache_peer_access P4 deny deny3_4
cache_peer_access P3 deny deny2_3
cache_peer_access P2 deny deny1_2
cache_peer_access P1 deny deny0_1

FYI, this setup is easy to test:

#!/usr/bin/env python3

import requests

url = "http://ip-api.com/json"
proxy = {"http": "http://droplet.ip.address:3128"}

r = requests.get(url, proxies=proxy)

print("Response:\n{}".format(r.text))

Just think: if one is programmatically (via API) leasing new proxy pools, one could easily write a script to procedurally generate this config file and restart this squid proxy…

Happy hacking 🙂