Basic text scraping from websites in Python

In my experience, Python is a easier programming languages to start with, mostly because of syntax which is very readable and minimal. For python apps that means reduced cost of maintenance and easier additional updates.

I was most impressed with a large base of components that can extend your application. The core of the Python offers much, but similar to node.js, it can be “extended” to other functionalities that community wrote.

Python is one of the languages that Google uses for scraping

python-scraper-2 — For test, we scraped Wikipedia’s main page

Some other option for scraping – http://stackoverflow.com/questions/2861/options-for-html-scraping

Scraping (extracting) text from websites is useful for translators, data analysts … and can be outputted as text or local database (.csv etc.)

For example – web scraping is useful when you need to:

extract all the images from the website with “one click”
analyze the website for the words analysis (what words are the most common etc.)
extract all the links from the website
do more advanced analysis of data that you scraped over a period of time (weather data … )

List of most popular languages for scraping and their extensions:

Python. Beautiful Soup. lxml. HTQL. Scrapy. …
Ruby. Nokogiri. Hpricot. Mechanize. scrAPI. …
.NET. Html Agility Pack. WatiN.
Perl. WWW::Mechanize. Web-Scraper.
Java. Tag Soup. HtmlUnit. Web-Harvest. jARVEST. …
JavaScript. request. cheerio. artoo. node-horseman. …
PHP. htmlSQL. PHP Simple HTML DOM Parser.
Most of them. Screen-Scraper.

This is not an online app, because for that we would need to setup Django Framework that would allow us to run Python online.

In this experiment scraped a Wikipedia’s main home website and generated .txt file. In the code, you see that we looked for h1, h2, h3 … tags, this is less than ideal cause the scraper could easily miss some texts on some sites that doesn’t use that tags that we defined.

python-scraper-3 — Code is fairly simple and understandable, thank also to the BeautifulSoup component

Similar tests was done scraping the pictures from the websites, which is also very useful in saving time.

If you never worked with Python there are some great tutorials for beginners on YouTube. If you are filing brave, go do one 🙂

Protip:

If you get utf-8 error, you can write this (in the console with path where the .py script is):

“chcp 65001” And then “set PYTHONIOENCODING=utf-8”

I am sure there is more elegant why to avoid this error but I am still a beginner 🙂

As I mentioned earlier, there are plenty of components to do various things – for this example, we are using BeautifulSoup

#! python3

import requests, re
from bs4 import BeautifulSoup

#What to search

urls = [‘http://google.com’]
text = “downloaded”

#Searching tags

list = [‘h1’, ‘h2’, ‘h3’, ‘p’, ‘a’, ‘ul’, ‘span’, ‘input’]

with open(str(text) +’.txt’, ‘w’, encoding=’utf-8′) as outfile:
for url in urls:

website = requests.get(url)
soup = BeautifulSoup(website.content, “lxml”)
tags = soup.find_all(list)
text = [”.join(s.findAll(text=True)) for s in tags]

text_len = len(text)

for item in text:
print(item, file=outfile)

print(“Done! File is saved where you have your scrape-website.py”)

TutsPlus have some fine tutorials for scraping if you are interesting.

If you have an idea, how this tool could help you, let us know in the comments or write us on info@2gika.si

Next time, we will be exploring node.js as a scraping tool.

Till next time 🙂