SourceBalm

Web Scraping with Python: Step by Step Guide

Published By

James Hook
An experienced Content Writer to work with a Big 4 consultancy on an exciting programme in the technology/AI (artificial intelligence) field, specifically within the transportation sector. Key responsibilities for the Content Writer include: Create white papers discussing subject matter in the technology/AI field, for applications within the transportation sector Utilise existing content ensuring it meets brand guidelines and drives the strategic priorities of the organisation Work collaboratively with colleagues The Successful Applicant will ideally have: Ability to produce written content, including editing and proofreading Strong understanding of technology language, drivers and outcomes Understanding of MS Office applications, Adobe Acrobat, Photoshop etc. Unrivalled attention to detail Good organisational skills including the ability to manage and reconcile competing priorities Good communication and interpersonal skills Ability to interact with stakeholders at various levels and ensure objectives are met Self-motivated, flexible and proactive attitude Exceptional English language skills

Top Tech Firms

Top App Developers in Singapore

There is no question about the fact that we have reached the...

Top 10 Sports Betting App and Sports Mobile App Development Companies

Sports betting is one of the oldest forms of gambling, carried out by generations for a very...

Top Mobile App Development Companies in USA and Worldwide

The global headcount of app developers is growing exponentially. Paradoxically, the gap between mobile and web app...

The need to extract data from websites is increasing. When we carry out projects related to data, such as price monitoring, business analysis or news aggregator, we will always have to register the data of the websites. However, copying and pasting data line by line has become outdated. In this article, we will teach you how to become an “expert” in extracting data from websites, which consists of doing  web scraping with python .

Introduction Article

Web scraping is a technique that could help us transform unstructured HTML data into structured data in a spreadsheet or database. In addition to using Python to write codes, accessing website data with API or  data extraction tools such  as  Octoparse  are other alternative options for web scraping.

For some large websites like Airbnb or Twitter, they would provide APIs for developers to access their data. API means application programming interface, which is the access for two applications to communicate with each other. For most people, API is the most optimal approach to obtain data provided by the website itself.

However, most websites do not have API services. Sometimes, even if they provide API, the data you could get is not what you want. Therefore, writing a Python script to create a web crawler becomes another powerful and flexible solution.

So why should we use python instead of other languages?

Flexibility:  As we know, websites are updated quickly. Not only the content but also the web structure would change frequently. Python is an easy-to-use language because it is dynamically imputable and highly productive. Therefore, people can easily change their code and keep up with the speed of web updates. Powerful:  Python has a large collection of mature libraries. For example, requests, beautifulsoup4 could help us obtain URLs and extract information from web pages. Selenium could help us avoid some anti-scraping techniques by giving web crawlers the ability to mimic human browsing behaviors. In addition, re, numpy and pandas could help us clean and process the data.

Now let’s start our journey in web scraping using Python!

Step 1: Import the Python library

In this tutorial, we will show you how to scrape Yelp reviews. We will use two libraries:  BeautifulSoup  in bs4 and  request  in urllib. These two libraries are commonly used in the construction of a web crawler with Python. The first step is to import these two libraries into Python so that we can use the functions in these libraries.

Step 2: Extract the HTML from the web page

We need to extract comments from “ https://www.yelp.com/biz/milk-and-cream-cereal-bar-new-york?osq=Ice+Cream ”. First, let’s save the URL in a variable called URL. Then we could access the content of this web page and save the HTML in “ourUrl” using the urlopen () function  in the request.

Then we apply BeautifulSoup to analyze the page.

Now that we have the “soup”, which is plain HTML for this website, we could use a function called  prettify ()  to clean the raw data and print it to see the nested HTML structure in the “soup”.

Step 3: Locate and scraping the reviews

Next, we should find the HTML reviews on this web page, extract and store them. For each element on the web page, they would always have a unique HTML “ID”. To verify your ID, we would need INSPECT on a web page.

After clicking on “Inspect element” (or “Inspect”, it depends on different browsers), we can see the HTML of the revisions.

In this case, the revisions are found under the tag called “p”. So, first we will use the function called find_all () to find the parent node of these reviews. And then locate all the elements labeled “p” under the parent node in a loop. After finding all the “p” elements, we would store them in an empty list called “review”.

Now we have all the reviews on that page. Let’s see how many reviews we have extracted.

Step 4: Clean the reviews

You should keep in mind that there are still some useless texts like “ <p lang = ‘en’> ” at the beginning of each review, “ <br/> ” in the middle of the reviews and “ </p> ” at the end of each revision.

“ <br/> ” represents a simple line break. We do not need any line breaks in reviews, so we will have to remove them. In addition, ” <p lang = ‘en’> ” and ” </p> ” are the beginning and end of HTML and we must also delete them.

Finally, we successfully obtain all clean revisions with less than 20 lines of code.

Here is just a demo for scraping 20 Yelp comments. But in real cases, we may have to face many other situations. For example, we will need steps such as paging to go to other pages and extract the remaining comments from this store. Or we will also have to extract other information such as the name of the reviewer, the location of the reviewer, the review time, the qualification, the check-in …

To implement the previous operation and obtain more data, we would need to learn more functions and libraries such as selenium or regular expression. It would be interesting to spend more time delving into the challenges of web scraping.

However, if you are looking for some simple ways to do web scraping,  Octoparse  could be your solution. Octoparse is a powerful web scraping tool that could help you easily obtain information from websites. Check out this tutorial on  how to scrape Yelp reviews with Octoparse . Do not hesitate to  contact us  when you need a powerful web scraping tool for your business or project!

- Advertisement -

1 COMMENT

  1. You can certainly see your enthusiasm in the paintings you write. The world hopes for even more passionate writers like you who aren’t afraid to mention how they believe. All the time go after your heart.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisement -

Future Technology

Cambium Network and Facebook Team Up For the Sake of Smart Cities

Of the many, many lessons we’ve taken away thus far from the coronavirus pandemic, it’s that the...

Facebook’s PyTorch3D : A Catalyst for Deep Learning and 3D Objects

To understand what PyTorch is, how it works, and its ability to catalyze technological advancements. It’s important first to understand the answer...

How AI Could Save the 3D Printing Industry and the Future of Machines

3D printing is a billion-dollar market with a variety of use cases- from healthcare, replicas to architecture, airplane parts.

How We Made a Simple Avatar Generator for Our Fitness Interviews

My name is Mads Phikamphon and I'm the founder of Bulk Hackers. At Bulk Hackers, we interview people who do great...

WhatsApp Users Hit 2 Billion: What Does This Mean for the Future of Privacy?

There are now over 2 billion registered users on the mobile messaging platform, up from 1.5 billion in 2017. Brief History...
- Advertisement -

More Articles Like This

- Advertisement -