Python Web Scraping Get Text

broken image


The goal of this post is to do Web Scraping in python and introduce basic NLP tasks like getting the word frequency.

  1. Python Web Scraping Get Text File
  2. Python Web Scraping Get Texture Pack
  3. Python Web Scraping Text Files
  4. Python Web Scraping Get Text To Form
  5. Python Web Scraping Tutorial

As you do more web scraping, you will find that the is used for hyperlinks. Now that we've identified the location of the links, let's get started on coding! We start by importing the following libraries. Import requests import urllib.request import time from bs4 import BeautifulSoup. Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python October 24, 2018 Once you've put together enough web scrapers, you start to feel like you can do it in your sleep.

The urllib and requests packages are used to scrape the data from websites. Scraping means getting the html content as text from a particular website. urllib is an old way of scraping. requests is the new way and it is more high-level, wherein, you don't have to worry about low-level details of making a web request.

Secondly, as an example, we will scrape the book 'Moby Dick' from project gutenburg's website to find the most frequent word used in this book. The following packages are used in this notebook:

Web scraping is the technique to extract data from a website. The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree. Related course: Browser Automation with Python Selenium. Get links from website. Web scraping and saving to a file using Python, BeautifulSoup and Requests Posted on July 23, 2017 March 12, 2018 by redshift In this segment you are going to learn how make a python command line program to scrape a website for all its links and save those links to a text file for later processing. My project is get the image of the orchid species start from plants list. This list is in excel format (.xls) If found the image, save it in the excel file near the name of species. Please, help me because don't know python, but I see the the best for web scraping.

  • urllib
  • requests
  • bs4 (BeautifulSoup)
  • nltk

Performing HTTP requests in Python using urllib

Python web scraping text files

You have just packaged and sent a GET request to 'http://www.datacamp.com/teach/documentation' and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.

Performing HTTP requests in Python using requests

Now that you've got your head and hands around making HTTP requests using the urllib package, you're going to figure out how to do the same using the higher-level requests library. You'll once again be pinging DataCamp servers for their 'http://www.datacamp.com/teach/documentation' page.

Note that unlike in the previous exercises using urllib, you don't have to close the connection when using requests!

Scraping the web in python

Get

Python Web Scraping Get Text File

We have just scraped HTML data from the web. You have done so using 2 different packages: urllib and requests. You also saw that requests provided a higher-level interface, i.e, you needed to write a fewer lines of code to retrieve the relevant HTML as a string.

HTML is a mix of unstructured and structed data.

In general, to turn the HTML that we got from the website to useful data you will need to parse it and extract structured data from it. You can perform this task using the python package BeautifulSoup.

The main object created and used when using this package is called BeautifulSoup. It has a very useful associated method called prettify. Let's see how we can use BeautifulSoup. The first step is to scrape the HTML using requests package.

Remember: The goal of using BeautifulSoup is to extract data from HTML.

Parsing HTML with BeautifulSoup

Use the BeautifulSoup package to parse, prettify and extract information from HTML.

Turning a webpage into data using BeautifulSoup: getting the text

Python Web Scraping Get Texture Pack

Next, you'll learn the basics of extracting information from HTML soup. In this exercise, you'll figure out how to extract the text from the BDFL's webpage, along with printing the webpage's title.

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

Text

In this exercise, you'll figure out how to extract the URLs of the hyperlinks from the BDFL's webpage. In the process, you'll become close friends with the soup method find_all().

Example: Word frequency in Moby Dick

What are the most frequent words in Herman Melville's novel, Moby Dick, and how often do they occur?In this notebook, we'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk).The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main Python packages we are going to use.

To analyze Moby Dick, we need to get the contents of Moby Dick from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .

To fetch the HTML file with Moby Dick we're going to use the request package to make a GET request for the website, which means we're getting data from it. This is what you're doing through a browser when visiting a webpage, but now we're getting the requested page directly into Python instead.

Get text from HTML

Python Web Scraping Text Files

This HTML is not quite what we want. However, it does contain what we want: the text of Moby Dick. What we need to do now is wrangle this HTML to extract the text of the novel. For this we'll use the package BeautifulSoup.

Firstly, a word on the name of the package: Beautiful Soup? In web development, the term 'tag soup' refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup. After creating the soup, we can use its .get_text() method to extract the text.

Python Web Scraping Get Text To Form

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use nltk – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.

OK! We're nearly there. Note that in the above ‘Or' has a capital ‘O' and that in other places it may not, but both ‘Or' and ‘or' should be counted as the same word. For this reason, we should build a list of all words in Moby Dick in which all capital letters have been made lower case.

Load in the stop words

It is common practice to remove words that appear a lot in the English language such as ‘the', ‘of' and ‘a' because they're not so interesting. Such words are known as stop words. The package nltk includes a good list of stop words in English that we can use.

Remove stop words in Moby Dick

We now want to create a new list with all words in Moby Dick, except those that are stop words (that is, those words listed in sw). One way to get this list is to loop over all elements of words and add each word to a new list if they are not in sw.

Our original question was:

What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

We are now ready to answer that! Let's create a word frequency distribution plot using nltk.

Create a word frequency distribution plot using nltk.

Text

You have just packaged and sent a GET request to 'http://www.datacamp.com/teach/documentation' and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.

Performing HTTP requests in Python using requests

Now that you've got your head and hands around making HTTP requests using the urllib package, you're going to figure out how to do the same using the higher-level requests library. You'll once again be pinging DataCamp servers for their 'http://www.datacamp.com/teach/documentation' page.

Note that unlike in the previous exercises using urllib, you don't have to close the connection when using requests!

Scraping the web in python

Python Web Scraping Get Text File

We have just scraped HTML data from the web. You have done so using 2 different packages: urllib and requests. You also saw that requests provided a higher-level interface, i.e, you needed to write a fewer lines of code to retrieve the relevant HTML as a string.

HTML is a mix of unstructured and structed data.

In general, to turn the HTML that we got from the website to useful data you will need to parse it and extract structured data from it. You can perform this task using the python package BeautifulSoup.

The main object created and used when using this package is called BeautifulSoup. It has a very useful associated method called prettify. Let's see how we can use BeautifulSoup. The first step is to scrape the HTML using requests package.

Remember: The goal of using BeautifulSoup is to extract data from HTML.

Parsing HTML with BeautifulSoup

Use the BeautifulSoup package to parse, prettify and extract information from HTML.

Turning a webpage into data using BeautifulSoup: getting the text

Python Web Scraping Get Texture Pack

Next, you'll learn the basics of extracting information from HTML soup. In this exercise, you'll figure out how to extract the text from the BDFL's webpage, along with printing the webpage's title.

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

In this exercise, you'll figure out how to extract the URLs of the hyperlinks from the BDFL's webpage. In the process, you'll become close friends with the soup method find_all().

Example: Word frequency in Moby Dick

What are the most frequent words in Herman Melville's novel, Moby Dick, and how often do they occur?In this notebook, we'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk).The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main Python packages we are going to use.

To analyze Moby Dick, we need to get the contents of Moby Dick from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .

To fetch the HTML file with Moby Dick we're going to use the request package to make a GET request for the website, which means we're getting data from it. This is what you're doing through a browser when visiting a webpage, but now we're getting the requested page directly into Python instead.

Get text from HTML

Python Web Scraping Text Files

This HTML is not quite what we want. However, it does contain what we want: the text of Moby Dick. What we need to do now is wrangle this HTML to extract the text of the novel. For this we'll use the package BeautifulSoup.

Firstly, a word on the name of the package: Beautiful Soup? In web development, the term 'tag soup' refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup. After creating the soup, we can use its .get_text() method to extract the text.

Python Web Scraping Get Text To Form

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use nltk – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.

OK! We're nearly there. Note that in the above ‘Or' has a capital ‘O' and that in other places it may not, but both ‘Or' and ‘or' should be counted as the same word. For this reason, we should build a list of all words in Moby Dick in which all capital letters have been made lower case.

Load in the stop words

It is common practice to remove words that appear a lot in the English language such as ‘the', ‘of' and ‘a' because they're not so interesting. Such words are known as stop words. The package nltk includes a good list of stop words in English that we can use.

Remove stop words in Moby Dick

We now want to create a new list with all words in Moby Dick, except those that are stop words (that is, those words listed in sw). One way to get this list is to loop over all elements of words and add each word to a new list if they are not in sw.

Our original question was:

What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

We are now ready to answer that! Let's create a word frequency distribution plot using nltk.

Create a word frequency distribution plot using nltk.

Python Web Scraping Tutorial

  • Create a frequency distribution object using the function nltk.FreqDist() and assign it to freqdist.
  • Use the plot method of freqdist to plot the 25 most frequent words.

The plot method of a FreqDist() object takes the number of items to plot as the first argument. Make sure to set this argument, otherwise plot will try to plot all the words which in the case of Moby Dick will take too long time.

Whale is the most frequent word in the book Moby Dick. No surprise there :)





broken image