Guide The Ghost In The Machine!
Splinter is an abstraction layer on top of existing browser automation tools such as Selenium. It has a high-level API that makes it easy to write automated tests of web applications or anything else you can think of.
We'll start by importing the libraries required, installing Ublock Origin and then opening a website with Chrome. You can download the ublock_origin.crx file I used in this tutorial HERE.
from splinter import Browser
from selenium.webdriver.chrome.options import Options
from time import sleep
chrome_options = Options()
chrome_options.add_extension('ext/ublock_origin.crx')
with Browser('chrome', chrome_options=chrome_options, headless=False) as browser:
url = 'https://recycledrobot.co.uk/words'
browser.visit(url)
sleep(5)
There's lots of links on that page. Let's grab the urls, stick them in a list and visit them all hovering the robot mouse over the next/prev links if they exist for no real reason.
with Browser('chrome', chrome_options=chrome_options, headless=False) as browser:
url = 'https://recycledrobot.co.uk/words'
browser.visit(url)
urls = browser.find_link_by_partial_href('?')
all_urls = []
for link in urls:
url = link['href']
all_urls.append(url)
for url in all_urls:
print(url)
browser.visit(url)
hovers = ['.prev', '.next']
for hover in hovers:
if browser.is_element_present_by_css(hover):
browser.find_by_css(hover).mouse_over()
sleep(2)
If you want to use BeautifulSoup you can get the html of the page with browser.html
and start scraping.
from splinter import Browser
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
def lovely_soup(r):
return BeautifulSoup(r, 'lxml')
chrome_options = Options()
chrome_options.add_extension('ext/ublock_origin.crx')
with Browser('chrome', chrome_options=chrome_options, headless=False) as browser:
url = 'https://recycledrobot.co.uk/words'
browser.visit(url)
urls = browser.find_link_by_partial_href('?')
all_urls = []
for link in urls:
url = link['href']
all_urls.append(url)
for url in all_urls:
print(url)
browser.visit(url)
hovers = ['.prev', '.next']
for hover in hovers:
if browser.is_element_present_by_css(hover):
browser.find_by_css(hover).mouse_over()
sleep(2)
soup = BeautifulSoup(browser.html, 'lxml')
title = soup.find('h2').text
print(title)
Thanks for reading. x
Resources
- Python: https://python.org
- Splinter: https://splinter.readthedocs.io
- uBlock Origin: https://github.com/gorhill/uBlock