How to get text from website python

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Perquisites:  

    • Beautiful soup
    • Urllib

    Scraping is an essential technique which helps us to retrieve useful data from a URL or a html file that can be used in another manner. The given article shows how to extract paragraph from a URL and save it as a text file.

    Modules Needed

    bs4: Beautiful Soup(bs4) is a Python library used for getting data from HTML and XML files. It can be installed as follows:

    pip install bs4

    urllib: urllib is a package that collects several modules for working with URLs. It can also be installed the same way, it is most of the in-built in the environment itself.

    pip install urllib

    Approach:

    • Create a text file.
    • Now for the program, import required module and pass URL and **.txt file path. This will make a copy of html code of that URL in your local machine.
    • Make requests instance and pass into URL
    • Open file in read mode and pass required parameter(s)
    • Pass the requests into a Beautifulsoup() function.
    • Create another file(or you can also write/append in existing file).
    • Then we can iterate, and find all the ‘p’ tags, and print each of the paragraph in our text file.

    The implementation is given below:

    Example:

    Python3

    import urllib.request

    from bs4 import BeautifulSoup

                               "/home/gpt/PycharmProjects/pythonProject1/test/text_file.txt")

    file = open("text_file.txt", "r")

    contents = file.read()

    soup = BeautifulSoup(contents, 'html.parser')

    f = open("test1.txt", "w")

    for data in soup.find_all("p"):

        sum = data.get_text()

        f.writelines(sum)

    f.close()

    Output:

    How to get text from website python

    View Discussion

    Improve Article

    Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Prerequisite: Downloading files in Python, Web Scraping with BeautifulSoup

    We all know that Python is a very easy programming language but what makes it cool are the great number of open source library written for it. Requests is one of the most widely used library. It allows us to open any HTTP/HTTPS website and let us do any kind of stuff we normally do on web and can also save sessions i.e cookie.
    As we all know that a webpage is just a piece of HTML code which is sent by the Web Server to our Browser, which in turn converts into the beautiful page. Now we need a mechanism to get hold of the HTML source code i.e finding some particular tags with a package called BeautifulSoup.
    Installation:

    pip3 install requests
    
    pip3 install beautifulsoup4
    

    We take an example by reading a news site Hindustan Times

    The code can be divided into three parts.

    • Requesting a webpage
    • Inspecting the tags
    • Print the appropriate contents

    Steps:

    1. Requesting a webpage: First we see right click on the news text to see the source code
      How to get text from website python
    2. Inspecting the tags: We need to figure in which body of the source code contains the news section we want to scrap. It is the under ul,i.e unordered list, “searchNews” which contains the news section.

      How to get text from website python

      Note The news text is present in the anchor tag text part. A close observation gives us the idea that all the news are in li, list, tags of the unordered tag.

      How to get text from website python

    3. Print the appropriate contents: The content is printed with the help of code given below.

      import requests

      from bs4 import BeautifulSoup

      def news():

          resp=requests.get(url)

          if resp.status_code==200:

              print("Successfully opened the web page")

              print("The news are as follow :-\n")

              soup=BeautifulSoup(resp.text,'html.parser')    

              l=soup.find("ul",{"class":"searchNews"})

              for i in l.findAll("a"):

                  print(i.text)

          else:

              print("Error")

      news()

      Output

      Successfully opened the web page
      The news are as follow :-
      Govt extends toll tax suspension, use of old notes for utility bills extended till Nov 14
      Modi, Abe seal historic civil nuclear pact: What it means for India
      Rahul queues up at bank, says it is to show solidarity with common man
      IS kills over 60 in Mosul, victims dressed in orange and marked 'traitors'
      Rock On 2 review: Farhan Akhtar, Arjun Rampal's band hasn't lost its magic
      Rumours of shortage in salt supply spark panic among consumers in UP
      Worrying truth: India ranks first in pneumonia, diarrhoea deaths among kids
      To hell with romance, here's why being single is the coolest way to be
      India vs England: Cheteshwar Pujara, Murali Vijay make merry with tons in Rajkot
      Akshay-Bhumi, SRK-Alia, Ajay-Parineeti: Age difference doesn't matter anymore
      Currency ban: Only one-third have bank access; NE, backward regions worst hit
      Nepal's central bank halts transactions with Rs 500, Rs 1000 Indian notes
      Political upheaval in Punjab after SC tells it to share Sutlej water
      Let's not kid ourselves, with Trump, what we have seen is what we will get
      Want to colour your hair? Try rose gold, the hottest hair trend this winter
      

    References

    • Requests
    • BeautifulSoup
    • Http_status_codes

    This article is contributed by Shubham Choudhary. If you like GeeksforGeeks and would like to contribute, you can also write an article using write.geeksforgeeks.org or mail your article to . See your article appearing on the GeeksforGeeks main page and help other Geeks.

    Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.


    How do I extract text from a website?

    Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.

    How do you get a specific text from HTML in Python?

    How to extract text from an HTML file in Python.
    url = "http://kite.com".
    html = urlopen(url). read().
    soup = BeautifulSoup(html).
    for script in soup(["script", "style"]):.
    script. decompose() delete out tags..
    strips = list(soup. stripped_strings).
    print(strips[:5]) print start of list..

    How do you extract text in Python?

    Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. Page object has function extractText() to extract text from the pdf page. At last, we close the pdf file object.

    How do I fetch HTML content in Python?

    The simplest solution is the following:.
    import requests. print(requests. get(url = 'https://google.com'). text) ... .
    import urllib. request as r. page = r. urlopen('https://google.com') ... .
    import urllib. request as r. page = r. urlopen('https://google.com') ... .
    <! doctype html>...</ html> <!.