Home Forum Research Web scraping data

  • Creator
    Topic
  • #4098
    jmc

    Over the summer I have been practicing web scraping with Python. If anyone has research that could benefit from batch downloading from ugly or tricky websites, feel free to post here! I would also be happy to give some tips if you want to do this yourself.

    For example, I downloaded all prison rate data from https://www.prisonstudies.org/world-prison-brief-data. The data is available for free, but you will notice that it is annoying to get a lot of data. You have to click on every country and then find the table(s) that have prison rate data.

    A little code and voila! Attached to this post is a csv of all world prison rate data from World Prison Brief. Below is all of the world data binned and plotted over time.

    The timing of my curiosity in downloading prison rate data and rising incarceration rates is not accidental. Hopefully 2020 is the year more people recognize that BLACK LIVES MATTER.

    Attachments:
    You must be logged in to view attached files.
Viewing 4 reply threads
  • Author
    Replies
    • #4105
      Jonathan Nitzan
      • Topics started: 24
      • Total posts: 128

      Intriguing, James.

      Can you elaborate on the code you are using and explain a bit more on what the chart shows?

      • #4109
        jmc
        • Topics started: 9
        • Total posts: 55

        Intriguing, James. Can you elaborate on the code you are using and explain a bit more on what the chart shows?

        I’ll post in steps.

        The code was not too difficult, but it is likely that the scraped website determines the complexity. I also believe that you have to be willing to find a pattern to iterate across different pages. For example, simple code will break if it says to look for 1998 data and one page does not have it.

        I was working in Python. A Python typical library for web scraping is BeautifulSoup. I am finding that pandas is easier, as my goal is to have to have a data frame of prison data.

        For those who are unfamiliar with data frames, they are popular data structures in R, python and other coding languages. Very long story short: they function a lot like Excel data, where you can edit cells but you can also apply functions across rows and columns. Here is me loading into a data frame a csv that is close to what I attached above:

        Depending on the software you are using, you can look at the data frame just like an Excel sheet. My preference is to make quick checks that everything is OK. Here is me wanting to see the first 10 rows of the data frame:

         

         

        • #4110
          jmc
          • Topics started: 9
          • Total posts: 55

          Knowing a bit about data frames is useful because when you pass an html page through pandas you are only a few steps from a data frame. Essentially, pandas is already looking for the tables in a website. BeautifulSoup is better when you want to be able to look for anything.

          A website like WPB will not give you a clean scrape. Rather your html tables come out as an ugly. But, let me show a comparison of how far you can get with a few lines of code. The country in the example is Tanzania.

          We do not have a nice data frame yet, but you can see that, for our purposes, pandas is taking a short path to getting the data points within a table.

    • #4302
      jmc
      • Topics started: 9
      • Total posts: 55

      A time-consuming aspect of web scrapping is figuring out how you will capture the pages you want. In the case of World Prison Brief, there is a side bar that is allowing the user to go from country to country, or region by region.

      Code can be written to find all of the pages in the html.

      • This reply was modified 1 year, 1 month ago by jmc.
      • This reply was modified 1 year, 1 month ago by jmc.
      • This reply was modified 1 year, 1 month ago by jmc.
    • #4422
      D. T. Cochrane
      • Topics started: 2
      • Total posts: 14

      I’m impressed, jealous and nervous, James!

      I have my own interests that would benefit from web scraping, but I haven’t tackled it yet. I’m just getting familiar with R and still take data out of it from time to time to work in Excel, because I’m not yet totally comfortable, especially with graphing.

      When I do finally tackle web scraping I’ll be sure to come to you.

    • #245121
      Blair Fix
      • Topics started: 4
      • Total posts: 53

      I too have been webscraping recently. I’m working on a project that involves mass downloads from Library Genesis. The tricky part is that their server(s) are unreliable and downloads sometimes fail.

      I’m currently using R to execute the Linux command wget. I have it keep a log for each file. I have a script that then reviews the logs and retries any downloads that failed.

      My code works, but it’s not very clean. I’m looking for a better alternative.

      Side note: The Library Genesis server is slow, so scraping more than a few thousand books is not really an option.

    • #245122
      Blair Fix
      • Topics started: 4
      • Total posts: 53

      Something else to consider. Sometimes, as in James’ example, the webpage is served in html. That’s easy to scrape. Other times, however, the server just provides javascript code, which your browser then renders. If you want to scrape that, you need to have software that renders the javascript to give you the ‘inner’ html. I use the Selenium Webdriver in python.

      • This reply was modified 10 months, 1 week ago by Blair Fix.
Viewing 4 reply threads
  • You must be logged in to reply to this topic.