blog
/Development

Hacking for Sales, Part 3

Ryan

In Part 1 and Part 2, we covered techniques for how to scrape data from pages on Crunchbase. Well, I have news for you, and I don’t want you to get upset. There’s a much easier way to do it, one that requires far less time. Instead of loading pages and scraping them, we’ll instead use the Crunchbase API.

What’s an API? It stands for Application Programming Interface, but you should think of it more as a drive-thru window. If the inside of a McDonalds is the website or web app itself, then the drive-thru with its (sometimes) limited menu and streamlined experience  is the API. Since we’re bypassing the user interface altogether, everything becomes much easier to program.

Step 1.

Register for the API. If you don’t already have a Mashery account, you may need to cerate one. This is easy and free.

Step 2.

Save your key into a local .py file. Again, you can use any text editor, but there are many free text editors for writing code. Save this file (call it crunchbase_api.py or something like that) in a folder on your computer. We’ll write the rest of the code in this file too. And while you’re at it, let’s import the libraries you’ll need: json and urllib.

import json, urllib
key = 'not_gonna_show_you_my_key'

 Step 3.

Prepare to be amazed! This is going to be so much easier than the last two.

First, let’s noodle through the documentation a little bit. You should actually read through this because it’s written in English and will give you a sense of how APIs work. You might notice that working with the API still involves URLs. Indeed, most APIs are little more than http calls to websites, which return data in plain structured text.

You might also have noticed a new acronym: JSON. This sounds scarier than it is (for it’s length, not the hockey-masked killer). You should rejoice whenever your API returns JSON data, because it’s super easy to use and has a native library to interpret it in Python. I’ll explain more when we get our first response.

Let’s start by listing some advertising companies.

url = 'http://api.crunchbase.com/v/1/search.js?query=advertising'
response = urllib.urlopen(url).read()
result = json.loads(response)

If you view the result, you’ll just see a lot of text. But buried within here are signs that the text is structured and iterable. You should note the locations of the commas, colons, square brackets, and curly brackets. They all play an important role in parsing this data. I’ll show you what I mean:

for r in result:
	print r

You should see this output:

total
crunchbase_url
page
results

Now go back and print the result (type “result” and hit enter). Scroll to the top of that block of text. Do you see “{u’total’: 7646, u’crunchbase_url’: u’http://www…”? JSON is basically a python dictionary, or a hash of key-value pairs. Here, ‘total’ is a key and 7646 is the value. To get only the total out of result, simply type result['total'] and press enter. Pretty intuitive, right? You get the Crunchbase URL the same way.

JSON results can be deeply nested, as seen in result['results']. The value of the key ‘results’ is a list of exactly 10 companies and their data (I calculated this with len(result['results']). This is, of course, iterable:

for r in result['results']:
	print r

Let’s just take one of them and store the name so we can look up the juicy data we really want.

company = result['results'][0]
name = company['name']

Here I took the first company in the list of results and then stored the value of ‘name’ in a variable called name. If you print “company”, you’ll see that company is itself a JSON-type dictionary object. That’s why I can use the ['name'] index to pull the name data I need for the next step.

Step 4.

Finally, the good stuff.  You might have noticed we haven’t used the API key yet. Now we will, and I’m going to show you another cool thing about Python.

qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
qry_response = urllib.urlopen(qry_url).read()
qry_result = json.loads(response)

Alright. What’s up with the % signs? These are placeholders for inserting variables into a string. The first %s is where our URL-friendly company name goes, and the second %s takes the key. This structure is common in modern languages, so take note. The % between the variables and the string with the %s’s is just convention that tells Python we’re using this technique.

The urllib.quote function replaces the space in company['name'] with safe %20 characters. Try typing urllib.quote(name) in the Python prompt and you’ll see what I mean.

And now.. drumroll… go ahead and print qry_result. There’s our data! More specifically:

qry_result['email_address']
qry_result['blog_url']
qry_result['phone_number']

Just like that, juicy, actionable sales data.

Step 5.

Here’s how we’d iterate through all the advertising companies and store the data we want into a list of our own.

my_results = []
for r in result['results']:
	sales_data = {}
	name = r['name']
	qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
	qry_response = urllib.urlopen(qry_url).read()
	qry_result = json.loads(qry_response)
	sales_data['email'] = qry_result['email_address']
	sales_data['blog'] = qry_result['blog_url']
	sales_data['phone'] = qry_result['phone_number']
	my_results.append(sales_data)

First, create the empty list we’ll fill up with data. Then, when we iterate through the results, we’re going to create a small dictionary for each company. sales_data starts each loop empty and fills up with the email, blog, and phone number of each business in result['results'].

But darn, if you run this, you’ll get this error: KeyError: ‘email_address’. That means one of my results had no email_address key. So, we’ll have to check for it before we pull it into our sales_data dictionary.

for r in result['results']:
	sales_data = {}
	name = r['name']
	print "Running", name
	qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
	qry_response = urllib.urlopen(qry_url).read()
	qry_result = json.loads(qry_response)
	sales_data['company'] = name
	if qry_result.has_key('email_address'): 
		sales_data['email'] = qry_result['email_address']
	if qry_result.has_key('blog_url'):
		sales_data['blog'] = qry_result['blog_url']
	if qry_result.has_key('phone_number'):
		sales_data['phone'] = qry_result['phone_number']
	my_results.append(sales_data)

There we go. Instead of having to scrape using BeautifulSoup, we can use urllib and json to make API calls and interpret the responses. If all goes well, you should see something like this:

[{'blog': '', 'phone': u'937.531.6631', 'company': u'Commuter Advertising', 'email': u'sparker@commuter-advertising.com'}, {'blog': u'http://www.qubed.us', 'phone': '', 'company': u'qubed advertising', 'email': u'oz@qubed.ro'}, {'blog': u'http://blog.mpression.net/', 'phone': u'+44 (0) 870 235 4042 ', 'company': u'4th Screen Advertising', 'email': u'info@4th-screen.com'}, {'blog': u'http://prova.com/blog/', 'phone': '', 'company': u'Prova | Advertising', 'email': u'support@prova.com'}, {'blog': u'http://hiliteadvertising.com/blog/', 'phone': u'877-457-5837', 'company': u'HiLite Advertising', 'email': u'info@hiliteadvertising.com'}, {'blog': '', 'phone': u'+91-011-26197623', 'company': u'WorldWide Advertising Network Private Ltd', 'email': u'info@worldwideadvertisingnetwork.com'}, {'blog': '', 'phone': u'401-272-1122', 'company': u'Creative Circle Advertising Solutions', 'email': u'bill@creativecirclemedia.com'}, {'company': u'VLG Advertising'}, {'blog': u'http://www.17stories.com/', 'phone': u'(512) 532-2907', 'company': u'Tocquigny Advertising & Interactive', 'email': u'awinsett@tocquigny.com'}, {'blog': u'http://www.blackdogadvertising.com/miami-advertising-blog/', 'phone': '', 'company': u'BlackDog Advertising', 'email': ''}]

There are hundreds of great APIs to explore for sales, including Yelp, LinkedIn, and Twitter. I’ll follow up with additional posts on each of these!

 

 

 

About Ryan