The need to extract data from websites is increasing. When we carry out projects related to data, such as price monitoring, business analysis or news aggregator, we will always have to register the data of the websites. However, copying and pasting data line by line has become outdated. In this article, we will teach you how to become an “expert” in extracting data from websites, which consists of doing web scraping with python .
Web scraping is a technique that could help us transform unstructured HTML data into structured data in a spreadsheet or database. In addition to using Python to write codes, accessing website data with API or data extraction tools such as Octoparse are other alternative options for web scraping.
For some large websites like Airbnb or Twitter, they would provide APIs for developers to access their data. API means application programming interface, which is the access for two applications to communicate with each other. For most people, API is the most optimal approach to obtain data provided by the website itself.
However, most websites do not have API services. Sometimes, even if they provide API, the data you could get is not what you want. Therefore, writing a Python script to create a web crawler becomes another powerful and flexible solution.
So why should we use python instead of other languages?
Flexibility: As we know, websites are updated quickly. Not only the content but also the web structure would change frequently. Python is an easy-to-use language because it is dynamically imputable and highly productive. Therefore, people can easily change their code and keep up with the speed of web updates. Powerful: Python has a large collection of mature libraries. For example, requests, beautifulsoup4 could help us obtain URLs and extract information from web pages. Selenium could help us avoid some anti-scraping techniques by giving web crawlers the ability to mimic human browsing behaviors. In addition, re, numpy and pandas could help us clean and process the data.
Now let’s start our journey in web scraping using Python!
Step 1: Import the Python library
In this tutorial, we will show you how to scrape Yelp reviews. We will use two libraries: BeautifulSoup in bs4 and request in urllib. These two libraries are commonly used in the construction of a web crawler with Python. The first step is to import these two libraries into Python so that we can use the functions in these libraries.
Step 2: Extract the HTML from the web page
We need to extract comments from “ https://www.yelp.com/biz/milk-and-cream-cereal-bar-new-york?osq=Ice+Cream ”. First, let’s save the URL in a variable called URL. Then we could access the content of this web page and save the HTML in “ourUrl” using the urlopen () function in the request.
Then we apply BeautifulSoup to analyze the page.
Now that we have the “soup”, which is plain HTML for this website, we could use a function called prettify () to clean the raw data and print it to see the nested HTML structure in the “soup”.
Step 3: Locate and scraping the reviews
Next, we should find the HTML reviews on this web page, extract and store them. For each element on the web page, they would always have a unique HTML “ID”. To verify your ID, we would need INSPECT on a web page.
After clicking on “Inspect element” (or “Inspect”, it depends on different browsers), we can see the HTML of the revisions.
In this case, the revisions are found under the tag called “p”. So, first we will use the function called find_all () to find the parent node of these reviews. And then locate all the elements labeled “p” under the parent node in a loop. After finding all the “p” elements, we would store them in an empty list called “review”.
Now we have all the reviews on that page. Let’s see how many reviews we have extracted.
Step 4: Clean the reviews
You should keep in mind that there are still some useless texts like “ <p lang = ‘en’> ” at the beginning of each review, “ <br/> ” in the middle of the reviews and “ </p> ” at the end of each revision.
“ <br/> ” represents a simple line break. We do not need any line breaks in reviews, so we will have to remove them. In addition, ” <p lang = ‘en’> ” and ” </p> ” are the beginning and end of HTML and we must also delete them.
Finally, we successfully obtain all clean revisions with less than 20 lines of code.
Here is just a demo for scraping 20 Yelp comments. But in real cases, we may have to face many other situations. For example, we will need steps such as paging to go to other pages and extract the remaining comments from this store. Or we will also have to extract other information such as the name of the reviewer, the location of the reviewer, the review time, the qualification, the check-in …
To implement the previous operation and obtain more data, we would need to learn more functions and libraries such as selenium or regular expression. It would be interesting to spend more time delving into the challenges of web scraping.
However, if you are looking for some simple ways to do web scraping, Octoparse could be your solution. Octoparse is a powerful web scraping tool that could help you easily obtain information from websites. Check out this tutorial on how to scrape Yelp reviews with Octoparse . Do not hesitate to contact us when you need a powerful web scraping tool for your business or project!