Well, it’s now legal to scrape LinkedIn. Sometimes, though, as Lil Wayne has said, real g’s move in silence like lasagna.
One way to do this is to rotate one’s requests through a proxy. Sometimes it’s rather convenient to just have one IP address to use as a proxy – and have that proxy forward through a round robin rotating pool. For example, one could lease new proxy pools (via API or manually), without having to update the scraper code.
This is easy to do implement through Squid, an open source proxy. Just provision an Ubuntu LTS instance, install squid + dependencies (there are many tutorials for Ubuntu LTS), and add this to the top of your squid config file:
# add this to the top of /etc/squid/squid.conf # then restart squid (service squid stop, service squid start) request_header_access Allow allow all request_header_access Authorization allow all request_header_access WWW-Authenticate allow all request_header_access Proxy-Authorization allow all request_header_access Proxy-Authenticate allow all request_header_access Cache-Control allow all request_header_access Content-Encoding allow all request_header_access Content-Length allow all request_header_access Content-Type allow all request_header_access Date allow all request_header_access Expires allow all request_header_access Host allow all request_header_access If-Modified-Since allow all request_header_access Last-Modified allow all request_header_access Location allow all request_header_access Pragma allow all request_header_access Accept allow all request_header_access Accept-Charset allow all request_header_access Accept-Encoding allow all request_header_access Accept-Language allow all request_header_access Content-Language allow all request_header_access Mime-Version allow all request_header_access Retry-After allow all request_header_access Title allow all request_header_access Connection allow all request_header_access Proxy-Connection allow all request_header_access User-Agent allow all request_header_access Cookie allow all request_header_access All deny all via off forwarded_for off follow_x_forwarded_for deny all acl deny24_25 random 24/25 acl deny23_24 random 23/24 acl deny22_23 random 22/23 acl deny21_22 random 21/22 acl deny20_21 random 20/21 acl deny19_20 random 19/20 acl deny18_19 random 18/19 acl deny17_18 random 17/18 acl deny16_17 random 16/17 acl deny15_16 random 15/16 acl deny14_15 random 14/15 acl deny13_14 random 13/14 acl deny12_13 random 12/13 acl deny11_12 random 11/12 acl deny10_11 random 10/11 acl deny9_10 random 9/10 acl deny8_9 random 8/9 acl deny7_8 random 7/8 acl deny6_7 random 6/7 acl deny5_6 random 5/6 acl deny4_5 random 4/5 acl deny3_4 random 3/4 acl deny2_3 random 2/3 acl deny1_2 random 1/2 acl deny0_1 random 1/1 never_direct allow all cache_peer 206.66.98.244 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P25 cache_peer 206.66.98.234 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P24 cache_peer 206.66.98.223 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P23 cache_peer 206.66.98.217 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P22 cache_peer 206.66.98.211 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P21 cache_peer 206.66.98.188 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P20 cache_peer 206.66.98.148 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P19 cache_peer 206.66.98.123 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P18 cache_peer 206.66.98.119 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P17 cache_peer 206.66.98.110 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P16 cache_peer 206.66.98.105 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P15 cache_peer 206.66.98.99 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P14 cache_peer 206.66.98.84 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P13 cache_peer 206.66.98.83 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P12 cache_peer 206.66.98.71 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P11 cache_peer 206.66.98.63 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P10 cache_peer 206.66.98.42 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P9 cache_peer 206.66.98.34 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P8 cache_peer 206.66.98.14 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P7 cache_peer 206.66.98.200 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P6 cache_peer 206.66.98.152 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P5 cache_peer 206.66.98.120 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P4 cache_peer 206.66.98.56 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P3 cache_peer 206.66.98.32 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P2 cache_peer 206.66.98.2 parent 60099 0 no-query default login=zackX:CKtL6Iza name=P1 cache_peer_access P25 deny deny24_25 cache_peer_access P24 deny deny23_24 cache_peer_access P23 deny deny22_23 cache_peer_access P22 deny deny21_22 cache_peer_access P21 deny deny20_21 cache_peer_access P20 deny deny19_20 cache_peer_access P19 deny deny18_19 cache_peer_access P18 deny deny17_18 cache_peer_access P17 deny deny16_17 cache_peer_access P16 deny deny15_16 cache_peer_access P15 deny deny14_15 cache_peer_access P14 deny deny13_14 cache_peer_access P13 deny deny12_13 cache_peer_access P12 deny deny11_12 cache_peer_access P11 deny deny10_11 cache_peer_access P10 deny deny9_10 cache_peer_access P9 deny deny8_9 cache_peer_access P8 deny deny7_8 cache_peer_access P7 deny deny6_7 cache_peer_access P6 deny deny5_6 cache_peer_access P5 deny deny4_5 cache_peer_access P4 deny deny3_4 cache_peer_access P3 deny deny2_3 cache_peer_access P2 deny deny1_2 cache_peer_access P1 deny deny0_1
FYI, this setup is easy to test:
#!/usr/bin/env python3 import requests url = "http://ip-api.com/json" proxy = {"http": "http://droplet.ip.address:3128"} r = requests.get(url, proxies=proxy) print("Response:\n{}".format(r.text))
Just think: if one is programmatically (via API) leasing new proxy pools, one could easily write a script to procedurally generate this config file and restart this squid proxy…
Happy hacking 🙂