Easy scraping with httpie and jq
Pulling my GitHub starred repositories into Hugo
- tags
- hugo
- scraping
- api
- jq
Contents
I recently saw a tweet mentioning the combination of using HTTPie (a command line HTTP client), jq (a lightweight and flexible command-line JSON processor) and Gron (Make JSON greppable!) was “all you needed to build a scraper.” Lets see if that’s true.
First install
Linux:
|
|
OSX
|
|
Lets get some stars
We can pull down the list of repositories people have starred on Github using the URL scheme https://api.github.com/users/:username/starred
. Try this:
|
|
This returns a wall of data! HTTPie
does a nice job of formatting the JSON out and showing the response headers.
Let’s use jq
to print out the first response:
|
|
Weirdly, the Github API returns the items in the order starred, but doesn’t include the date that we starred them. There’s an alternative api call that we can use to get the date that I pushed the Star button. For this to work we need to pass in an Accept header of application/vnd.github.v3.star+json
which is done like so:
|
|
jq
lets us slice and dice the JSON. Lets pull out the fields that we want. We are using the .[]
syntax and piping it to a json writer, which is pulling only the fields out that we want. And we write the output to the data directory in our hugo app.
|
|
Using the hugo data directory
Now that we have the data in a JSON format that Hugo can understand, lets build a page to render this glory.
Create a simple file in content/stars.md
that we’ll use to define a kind of stars
|
|
Now we create a single page template that we’ll use to render it in layouts/stars/single.html
:
|
|
Now start up your server and go to http://localhost:1313/stars !
Less easy scraping
So it turns out that I’ve been liking lots of things on GitHub and I have more than one page of results. The GitHub api used the HTTP Link
header to point to next
, prev
, first
, and last
last pages. We’ll write a little script that saves that and then parses it out.
- We create a small function that passes in
-dh
flags tohttp
, redirecting stdout to our file>page1.json
and stderr to a file containing the headers2>headers
. - Then we’ll parse the headers using grep, tr, and sed to pull out the url that’s matched with
rel="next"
. - If there is one we’ll follow that and download
page2.json
etc. - Then we’ll merge all of the files together using
jq --slurp '[.[][]]' *json
so the muliple files of JSON Arrays is one big JSON array. - And then copy over our existing
jq
parsing
Here’s update_github_stars.sh
:
|
|
Conclusion
My Star List is a little unwieldy right now with 120 entries (js '. | length' data/stars.json
) and not all of the descriptions are that informative, but this was all built with simple tools and minimal dependacies. We didn’t need any special libraries to script this, Gemfile
or package.json
installations, just a relatively simple bash script.
Previously
Next