Home Forum Research Web scraping data

  • Creator
    Topic
  • #4098
    jmc

      Over the summer I have been practicing web scraping with Python. If anyone has research that could benefit from batch downloading from ugly or tricky websites, feel free to post here! I would also be happy to give some tips if you want to do this yourself.

      For example, I downloaded all prison rate data from https://www.prisonstudies.org/world-prison-brief-data. The data is available for free, but you will notice that it is annoying to get a lot of data. You have to click on every country and then find the table(s) that have prison rate data.

      A little code and voila! Attached to this post is a csv of all world prison rate data from World Prison Brief. Below is all of the world data binned and plotted over time.

      The timing of my curiosity in downloading prison rate data and rising incarceration rates is not accidental. Hopefully 2020 is the year more people recognize that BLACK LIVES MATTER.

      Attachments:
      You must be logged in to view attached files.
    Viewing 7 reply threads
    • Author
      Replies
      • #4105

        Intriguing, James.

        Can you elaborate on the code you are using and explain a bit more on what the chart shows?

        • #4109
          jmc

            Intriguing, James. Can you elaborate on the code you are using and explain a bit more on what the chart shows?

            I’ll post in steps.

            The code was not too difficult, but it is likely that the scraped website determines the complexity. I also believe that you have to be willing to find a pattern to iterate across different pages. For example, simple code will break if it says to look for 1998 data and one page does not have it.

            I was working in Python. A Python typical library for web scraping is BeautifulSoup. I am finding that pandas is easier, as my goal is to have to have a data frame of prison data.

            For those who are unfamiliar with data frames, they are popular data structures in R, python and other coding languages. Very long story short: they function a lot like Excel data, where you can edit cells but you can also apply functions across rows and columns. Here is me loading into a data frame a csv that is close to what I attached above:

            Depending on the software you are using, you can look at the data frame just like an Excel sheet. My preference is to make quick checks that everything is OK. Here is me wanting to see the first 10 rows of the data frame:

             

             

            • #4110
              jmc

                Knowing a bit about data frames is useful because when you pass an html page through pandas you are only a few steps from a data frame. Essentially, pandas is already looking for the tables in a website. BeautifulSoup is better when you want to be able to look for anything.

                A website like WPB will not give you a clean scrape. Rather your html tables come out as an ugly. But, let me show a comparison of how far you can get with a few lines of code. The country in the example is Tanzania.

                We do not have a nice data frame yet, but you can see that, for our purposes, pandas is taking a short path to getting the data points within a table.

          • #4302
            jmc

              A time-consuming aspect of web scrapping is figuring out how you will capture the pages you want. In the case of World Prison Brief, there is a side bar that is allowing the user to go from country to country, or region by region.

              Code can be written to find all of the pages in the html.

              • This reply was modified 4 years, 4 months ago by jmc.
              • This reply was modified 4 years, 4 months ago by jmc.
              • This reply was modified 4 years, 4 months ago by jmc.
            • #4422

              I’m impressed, jealous and nervous, James!

              I have my own interests that would benefit from web scraping, but I haven’t tackled it yet. I’m just getting familiar with R and still take data out of it from time to time to work in Excel, because I’m not yet totally comfortable, especially with graphing.

              When I do finally tackle web scraping I’ll be sure to come to you.

            • #245121

              I too have been webscraping recently. I’m working on a project that involves mass downloads from Library Genesis. The tricky part is that their server(s) are unreliable and downloads sometimes fail.

              I’m currently using R to execute the Linux command wget. I have it keep a log for each file. I have a script that then reviews the logs and retries any downloads that failed.

              My code works, but it’s not very clean. I’m looking for a better alternative.

              Side note: The Library Genesis server is slow, so scraping more than a few thousand books is not really an option.

            • #245122

              Something else to consider. Sometimes, as in James’ example, the webpage is served in html. That’s easy to scrape. Other times, however, the server just provides javascript code, which your browser then renders. If you want to scrape that, you need to have software that renders the javascript to give you the ‘inner’ html. I use the Selenium Webdriver in python.

              • This reply was modified 4 years, 1 month ago by Blair Fix.
              • #247716
                jmc

                  As this thread in bumping up, I do want to echo this. Selenium is powerful. You can do all sorts of actions, including scrolling up and down a dynamic table.

              • #247708

                James! This is amazing!

                How can we decipher the 3-digit country codes that you used?

              • #247965

                I’m finally taking on web scraping. I’m still making the climb up Mt. R, so I’m trying to do it with that. Actually, I’m not even learning R generally. I’m focusing on the tidyverse.

                The package rvest—which is part of the tidyverse—has many good introductions online and is quite intuitive. However, now I’m trying to iterate. Based on your post, James, I suspect this is simpler with Python. But I worry about trying to jump back-and-forth between R and Python. I’m going through the ‘Program’ section of R for Data Science because I’m long overdue to start employing functions, which will also make iterating with rvest simpler and more repeatable.

                Sidenote: I’m continually blown away by the amount of knowledge that people give away online. You cannot tell me that Elon Musk, Mark Zuckerberg, or Jeff Bezos has done more good for the world than Hadley Wickham. I suspect that if there is any problem humans need solved there is someone who wants to solve it for the pleasure of the challenge and the esteem. And how many potential Hadley Wickhams are there who’d love to give away knowledge for free, but they have to exhaust themselves struggling just to make ends meed?

              • #247978

                Actually, I’m not even learning R generally. I’m focusing on the tidyverse.

                That’s a good point, Troy. Programming languages are just platforms to share code using a common language. So coding in R or python or C++ can mean many different things, depending on the libraries you are using. That’s something I appreciate more and more.

                When analyzing data in R, I spend most of my time using data.table. And of course, like you, I’m grateful that Hadley Wickham has contributed so much to the R ecosystem.

                When I code in C++, I’ve found the Armadillo library to be extremely helpful.

                Funny that you mention the Tech billionaires. Armadillo is a heavy-duty linear algebra library, and I’m told that big tech uses it extensively. But it was written for free by … academics.

            Viewing 7 reply threads
            • You must be logged in to reply to this topic.