Bash Scripting – Extracting Information from eBay HTML Pages

I would like to filter the output of a search on ebay which I've exported to a text file. There are a number of results in the search but I've included just one example at the bottom of this post as I presume I'll be able to use the same method to filter all of the results (keeps things neat on here!). I kind of know the basics of filtering out the url using sed and grep, but I would like the output displayed in a specific way. I want the url followed by a comma, then followed by the price. For example:

http://www.ebay.co.uk/itm/Principles-Of-Modern-Chemistry-International-Edition-Gillis-H-Pat-Oxtoby-Ca-/161952820281?hash=item25b523ec39:g:MEYAAOSwoydWnvT2, £73.69

One thing I will point out at this point is that there are some url's within the text file that are not of use (e.g. http://thumbs.ebaystatic.com/images/g/MEYAAOSwoydWnvT2/s-l225.jpg) but they have a different format to the type i am interested in (i.e. the one I used in the first example). Does anyone know how I can achieve this? Thanks

<h3 class="lvtitle"><a href="http://www.ebay.co.uk/itm/Principles-Of-Modern-Chemistry-International-Edition-Gillis-H-Pat-Oxtoby-Ca-/161952820281?hash=item25b523ec39:g:MEYAAOSwoydWnvT2"  class="vip" title="Click this link to access Principles Of Modern Chemistry, International Edition Gillis, H. Pat; Oxtoby; Ca">Principles Of Modern Chemistry, International Edition Gillis, H. Pat; Oxtoby; Ca</a>^M
                </h3>^M
        <ul class="lvprices left space-zero">^M
^M
        <li class="lvprice prc">^M
                        <span  class="bold bidsold">
                                        £73.69</span>
                                </li>^M
                <li class="lvformat">^M
                        <span >
                                <span class="logoBin" title="Buy it now"></span>
                                        </span>

Best Answer

The best way to get at data from eBay is through their API. This being said, sometimes all you have is HTML, so I'll cover that in my answer.

Don't even try to extract information from HTML with tools like sed and grep. It's hard to do when it works at all, and extremely brittle. This way lies madness.

If you have to parse HTML, use a tool for parsing HTML, such as Python's BeautifulSoup library, Perl's HTML::TreeBuilder, Ruby's nokogiri, etc.

#!/usr/bin/env python2
import codecs, sys, BeautifulSoup
html = BeautifulSoup.BeautifulSoup(codecs.open(sys.argv[1], "r", "utf-8").read())
for lv in html.findAll("h3", "lvtitle"):
    url = lv.find("a")["href"]
    bid = lv.findNextSibling("ul").find("span", "bidsold").text.strip()
    print(url, bid)

Best Answer

Related Solutions

Bash – extract filenames from html file containing multiple links

Related Question