Web Scraping using XPath
With other sophisticated web scraping methods around like BeautifulSoup, for my project, I decided to go with a relatively basic method of extracting information by inspecting the source code and copying its XPath for easy navigation and may provide a more intuitive approach to web scraping.
The general structure goes like this…
- Define key functions for web scrapping
- Open URL of the desired webpage to scrape
- Identify information to extract from the page
- Inspect HTML source code of the specific containers containing the information you need
- For iteration, compare XPath of 2 different rows of data and note the similarities and differences between the 2 XPath’s and create a for-loop around the differences
- Repeat step 3–5 for other information to extract
- Save arrays of data to csv format
Let’s Try It Out!
We will try to use the step-by-step guide as described above on the ATPTour website. For this example, we want to extract information on different tournaments played during the season.
1. Key Functions
First, we have to define key functions for web-scrapping, these are very reusable functions that can be applied onto another web scraping application.
These are the 3 key functions that can be used for any HTML/XML source codes. A brief explanation of the functions as follows:
a. array2csv — This function is fairly straightforward as it converts arrays into csv format and it is easy to use for storage purposes.
b. html_parse_tree — This function uses the requests library to fetch the source code from the URL and uses the html.fromstring() method to read the content.
c. xpath_parse — This function uses .xpath() method to query specific XPath command from the HTML tree structure from the previous function.
2. Open URL of the Webpage
URL: https://www.atptour.com/en/scores/results-archive?year=2019
The webpage is displayed in this format:
From the screenshot above, we can see a listing of the different tournaments played during the 2019 season. It includes information such as the tournament type, tournament name, location, dates, number of entries, surface and conditions, financial commitments and the winner for both the singles and doubles tournament.
3. Identify Information to Extract
Let’s say we want to get a table of all the tournaments for the build-up to the first Grand Slam of the year, The Australian Open. In the tables, we want to include the following information:
- Tournament Name
- Location
- Condition
- Surface
4. Inspect HTML source code of the specific containers
Now that we defined our desired final table, we start by extracting the tournament name. We begin by inspecting the container containing the information by hovering the cursor to the tournament name and right-clicking it to inspect the element, as displayed below:
After clicking inspect, it will direct you to the HTML code containing the specific information,
As this web-scraping method is optimized for XPath queries, we will right-click the bolded container in the inspection screen to obtain the XPath, like shown below:
In this case, we would obtain a specific XPath of:
//*[@id=”scoresResultsArchive”]/table/tbody/tr[1]/td[3]/span[1]
This line describes the path to navigate around the HTML tree to get to the points that we want. We can then query an XPath command with the information with the xpath_parse we defined before:
Note the minor change in Xpath used, we added a “/text()” at the end of the XPath command to display the element in the container. And as we can see from the output, we obtained a string containing Doha as expected with lots of redundant white spaces. To solve this problem, we can use the .strip() method to clean the string.
5. Iteration over Rows of Data
As easy as the process above may be, it will still be troublesome to do it manually for thousands of rows of data, so it would be effective to build a for-loop around it. The best way to do so is to repeat the same step as above for the second row of data and compare the 2 XPath obtained.
XPath for Row 1:
//*[@id=”scoresResultsArchive”]/table/tbody/tr[1]/td[3]/span[1]
XPath for Row 2:
//*[@id=”scoresResultsArchive”]/table/tbody/tr[2]/td[3]/span[1]
Can you spot the difference ?
Everything is identical except for the number following tr[]. We can then use this information to build a for-loop extracting all the tournament names we want. An example is shown below, extracting all the tournament names up until the Australian Open.
6. Repeat Steps 3–5 for Other Variable
The same steps can be applied to other variables we wanted, such as location, condition, and surface.
7. Save Arrays to CSV
Using the array2csv function defined above, we can save the data in csv format with any filename we want. This is the final result of this web-scraping tutorial in pandas dataframe!
The web-scraping section of my project can be found from my Github.