Basic text scraping from websites in Python
In my experience, Python is a easier programming languages to start with, mostly because of syntax which is very readable and minimal. For python apps that means reduced cost of maintenance and easier additional updates.
I was most impressed with a large base of components that can extend your application. The core of the Python offers much, but similar to node.js, it can be “extended” to other functionalities that community wrote.
Python is one of the languages that Google uses for scraping
Some other option for scraping – http://stackoverflow.com/questions/2861/options-for-html-scraping
Scraping (extracting) text from websites is useful for translators, data analysts … and can be outputted as text or local database (.csv etc.)
For example – web scraping is useful when you need to:
- extract all the images from the website with “one click”
- analyze the website for the words analysis (what words are the most common etc.)
- extract all the links from the website
- do more advanced analysis of data that you scraped over a period of time (weather data … )
List of most popular languages for scraping and their extensions:
- Python. Beautiful Soup. lxml. HTQL. Scrapy. …
- Ruby. Nokogiri. Hpricot. Mechanize. scrAPI. …
- .NET. Html Agility Pack. WatiN.
- Perl. WWW::Mechanize. Web-Scraper.
- Java. Tag Soup. HtmlUnit. Web-Harvest. jARVEST. …
- JavaScript. request. cheerio. artoo. node-horseman. …
- PHP. htmlSQL. PHP Simple HTML DOM Parser.
- Most of them. Screen-Scraper.
This is not an online app, because for that we would need to setup Django Framework that would allow us to run Python online.
In this experiment scraped a Wikipedia’s main home website and generated .txt file. In the code, you see that we looked for h1, h2, h3 … tags, this is less than ideal cause the scraper could easily miss some texts on some sites that doesn’t use that tags that we defined.
Similar tests was done scraping the pictures from the websites, which is also very useful in saving time.
If you never worked with Python there are some great tutorials for beginners on YouTube. If you are filing brave, go do one 🙂
Protip:
If you get utf-8 error, you can write this (in the console with path where the .py script is):
“chcp 65001” And then “set PYTHONIOENCODING=utf-8”
I am sure there is more elegant why to avoid this error but I am still a beginner 🙂
As I mentioned earlier, there are plenty of components to do various things – for this example, we are using BeautifulSoup
#! python3
import requests, re
from bs4 import BeautifulSoup
#What to search
urls = [‘http://google.com’]
text = “downloaded”
#Searching tags
list = [‘h1’, ‘h2’, ‘h3’, ‘p’, ‘a’, ‘ul’, ‘span’, ‘input’]
with open(str(text) +’.txt’, ‘w’, encoding=’utf-8′) as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content, “lxml”)
tags = soup.find_all(list)
text = [”.join(s.findAll(text=True)) for s in tags]
text_len = len(text)
for item in text:
print(item, file=outfile)
print(“Done! File is saved where you have your scrape-website.py”)
TutsPlus have some fine tutorials for scraping if you are interesting.
If you have an idea, how this tool could help you, let us know in the comments or write us on info@2gika.si
Next time, we will be exploring node.js as a scraping tool.
Till next time 🙂
Magical automated way of doing animations – with Stable diffusion with Deforum extension
Let’s explore the magical of doing animations – automated and unpredicable. Yes, you input the text prompt and have general control, but the magical part is, that you let the […]
Stable diffusion AI – high resolution generated art
What is Stable Diffusion AI? Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it […]
Tailwind CSS – Dream way of building UI and quick landing pages
This is a quick review and demo of the “utility style CSS framework” named “Tailwindcss” – https://tailwindui.com/ What is Tailwind CSS? Tailwind CSS is a utility-first CSS framework that provides […]
Blender 3D | Best way to start with 3D
It’s been a while since I done 3D stills and animations and 3D Blender seems to be perfect software to start, even if you have zero experience. I have to […]