Personal information from only a URL

what can automated tools find out

Published November 14, 2014 #ruby #socialinvestigator

This article was published a while ago and may contain obsolete information!

Ever wonder what you can find out by looking at a url? How about physical addresses, server location, emails, phone numbers, various links to other profiles (which can in turn be structurally scraped), technology stack, and more.

$ socialinvestigator net page
          created_on: 2014-10-31
          expires_on: 2015-10-31
          updated_on: 2014-10-31
      registrar_name: ENOM, INC.
                     name: WILL SCHENK
             organization: HAPPYFUNCORP
                  address: 18 BRIDGE STREET, 2E
                     city: BROOKLYN
                      zip: 11201
                    state: NY
             country_code: US
                    phone: +91.76976430
                    email: WSCHENK@GMAIL.COM
      server_country: United States
     server_location: Ashburn, Virginia
     server_latitude: 39.0437
    server_longitude: -77.4875
     server_ip_owner: Amazon Technologies Inc. (AT-88-Z)
               title: Will Schenk
         description: The blog of Will Schenk
      twitter_author: wschenk
         twitter_ids: wschenk
          responsive: true
            rss_feed: /feed.rss
           atom_feed: /feed
        technologies: Chartbeat, Font Awesome, Google Analytics, RackCache, Ruby

Standalone code as a gist, the complete socialinvestigator code available on github or and is easily installable on your machine as a gem.

$ gem install socialinvestigator
$ socialinvestigator net get_apps_json
$ socialinvestigator net page_info url

Poking around different urls can give you a sense of the corporate entities behind sites, who is actually involved, and help you track down people when you can’t find them otherwise. It’s actually hard to figure out which URL to include on this post since the data seems so personal and yet people put it out there. This takes the messy HTML that’s out there and returns structured information that you can use to explore profiles on other sites in a way that can be totally automated.

What does it do

What this code does is to first search for DNS information to see who owns the domain and if there’s any contact information associated with it. It then looks at who owns the IP address and tries to locate where it is geographically.

It then looks at the page itself to see Open Graph meta data,Twitter Card meta data and other basic SEO tags.

Finally, it looks inside the page for likely looking links to other social networks, and scans the page and HTTP metadata for clues about what underlying technology the site was built in. (The metadata for the technology fingerprinting is from the Wappalyzer project which I cobbled together a basic ruby engine for.)

And finally it takes all of the facts that it has collected, figured out which ones take priority, and prints them out.

Finding Domain info

The first thing that we do is look for the URL and try to find the domain name. The difference between a hostname and domain name is subtle, partly in some cases they are interchangeable, and partly because DNS is the second most amazing thing about the Internet. (The most truly mind-blowing thing clearly the default route, the life, liberty, and pursuit of happiness of the Internet.) A globe spanning, highly distributed database that lets 2.5 billion internet users look up any of the 4 billion potential server addresses in less than 50ms without any real centralized control isn’t exactly straightforward.

DNS manages this complexity by delegating authority for different branches of the entire namespace. The first level is called the Top Level Domains, most famous being .com, when you buy a domain name from someone they delegate authority over that name space to you. These can go deep, especially with large global organizations. The first thing we do is look for that Start of Authority (SOA) record for machine named in the URL. If we can’t find one for that machine, we look up the chain until we find something.

This looks like:

require 'dnsruby'

hostname = URI(url).hostname

def find_domain( hostname )
  puts "Looking for SOA of #{hostname}"
  dns =
  soa = dns.query( hostname, "SOA" ) do |rr|
    rr.is_a? Dnsruby::RR::IN::SOA

  return hostname if soa.length > 0

  # Go from "" -> ""
  parts = hostname.split( /\./ )
  return nil if parts.length <= 2

  find_domain( parts.slice(1,100).join( "." ) )

Once we’ve found the domain, we query the whois databases to find out who has owns the domain name.

require 'whois'

whois = Whois.lookup( domain )

puts "Expires: #{whois.expires_on}"
# Print all contact information we find
whois.contacts.each { |c| puts c }

One of the challenges here is that there is no standardized format that there is no standardized way of parsing whois responses. The whois gem gives it a serious try:

$ ls -l `bundle show whois`/lib/whois/record/parser | wc -l

But there’s over 500 different whois servers out there, so you won’t always get a parseable response. In that case we print out that we can’t find a parser, and we store the unparsed response in the data object as unparsed_whois. do |p|
  # Check for responses that we couldn't parse
  if Whois::Record::Parser.parser_for(p).is_a? Whois::Record::Parser::Blank
    puts "Couldn't find a parser for #{}:"
    puts p.body

Finding IP and hosting information

Now we look at the IP address, and then do a reverse lookup on it to see what the server machine name is.

ip_address = Dnsruby::Resolv.getaddress

data.remember :ip_address, ip_address
  data.remember :server_name, Dnsruby::Resolv.getname( ip_address )
rescue Dnsruby::NXDomain
  # Couldn't do the reverse lookup

Sometimes interesting things are encoded in the server name, like if it’s a Rackspace cloud server vs a Rackspace static server, but we make no attempt to interpret that string.

Then we try and see where the IP address is located geographically, using If you did a lot of this it would make sense to buy a more detailed database from Maxmind but for something quick and dirty this works. Given that you need to follow the rules of the company you are in, it’s interesting to see where the servers are located.

location_info = HTTParty.get('' + ip_address)

data.remember :server_country, location_info['country']
data.remember :server_location, [location_info['city'], location_info['region_name']].select { |x| x }.join( ", ")
data.remember :server_latitude, location_info['latitude']
data.remember :server_longitude, location_info['longitude']

We can also do a whois lookup on the IP address, to see who owns that IP block. This should give us an idea of who is hosting the site. Note that we don’t even pretend to parse the whois response here in a clever way.

ip_whois = Whois.lookup ip_address { |x| x=~/Organization/ }.each do |org|
  if org =~ /Organization:\s*(.*)\n/
    data.another :server_ip_owner, $1

Page meta data

Now we load up the page, and look for some basic stuff. The first thing that we do is load the meta tags into something more accessible.

response = HTTParty.get url
parsed = Nokogiri.parse response.body

# Meta tags

meta = {}
parsed.css( "meta[name]" ).each do |t|
  meta[t.attributes["name"].value] = t.attributes["content"].value if t.attributes["content"]

parsed.css( "meta[property]" ).each do |t|
  meta[t.attributes["property"].value] = t.attributes["content"].value

Now we load up some basic SEO info, including if there are any feeds for this site’s content.

data.remember( :author, meta['author'] )
data.remember( :description, meta['description'] )
data.remember( :keywords, meta['keywords'] )
data.remember( :generator, meta['generator'])
data.remember( :responsive, true )  if meta["viewport"] =~ /width=device-width/
data.remember( :server, response.headers['server'] )
data.remember( :page_title, parsed.title )

# RSS Feed:
if feed = parsed.css( 'link[type="application/rss+xml"]' ).first
  feed = feed.attributes['href'].value
  data.remember( :rss_feed, feed )

# Atom Feed:
if feed = parsed.css( 'link[type="application/atom+xml"]' ).first
  feed = feed.attributes['href'].value
  data.remember( :atom_feed, feed )

Twitter Cards

Twitter Card meta data is a way to control how your data gets displayed on twitter, which has the benefit of defining some summary meta data around the social graph. One thing thing to note is that twitter:creator is the author of this page, while twitter:site is the twitter account for the overall site.

data.remember( :twitter_title, meta["twitter:title"] )
data.remember( :twitter_creator, meta["twitter:creator"] )
if /@(.*)/.match( meta["twitter:creator"] )
  data.another( :twitter_ids, $1 )
data.remember( :twitter_site_author, meta["twitter:site"] )
if /@(.*)/.match( meta["twitter:site"] )
  data.another( :twitter_ids, $1 )
data.remember( :twitter_image, meta["twitter:image"] )
data.remember( :twitter_description, meta["twitter:description"] )

Open Graph

Open Graph meta data is really about what your link looks like when someone shares it on Facebook.

data.remember( :og_title, meta["og:title"] )
data.remember( :og_description, meta["og:description"] )
data.remember( :og_type, meta["og:type"] )
data.remember( :og_image, meta["og:image"] )

We search for social links:


For Twitter, Facebook, and Google+ we are only letting through links that have a simple query string, since for the most part this means that it’s the user’s ID.

Parsing Twitter Shares and Intents

We then look for Twitter Share links, and try and parse out the user names found in there.

# Look for twitter shared links

twitter_shared = matching_links( parsed, /\/share/ )

twitter_shared.each do |l|
  text = l['data-text']

  # See if there's a "by @user" in the text
  if /by\s*@([^\s]*)/.match text
    data.another( :twitter_ids, $1 )
    data.remember( :twitter_by, $1 )

  # Look for all "@usernames" in the text
  if text { |x| x =~ /@\s*/ }.each do |id|
      data.another( :twitter_ids, id.slice( 1,100 ) ) # We don't want the @

  # See if there's a via link on the anchor tag
  if l['data-via']
    data.another( :twitter_ids, l['data-via'])

  possible_via = URI.decode( (URI(l['href']).query) || "" ).split( /&amp;/ ).collect { |x| x.split( /=/  ) }.select { |x| x[0] == 'via' }
  if possible_via.size > 0
    data.another( :twitter_ids, possible_via[0][1] )

There are also twitter intent links:

twitter_intent = hrefs( matching_links( parsed, /\/intent/ ) )

twitter_intent.each do |t|
  URI.decode( URI(t.gsub( / /, "+" )).query ).split( /&/ ).select do |x|
    x =~ /via/
  end.collect do |x|
    x.gsub( /via=/, "" )
  end.each do |via|
    data.another( :twitter_ids, via )

Technology Finger Prints

The final thing we do is to load the apps.json file from Wappalyzer which is a cross-platform utility that uncovers the technologies used on websites. This has a list of regex for the header tags, meta tags, scripts and other parts of the html to make guesses about which technology is in place. What is in place is very rudimentary, but it gives a general sense of what is used to made the site.


The standalone code as a gist, and you can check out the complete socialinvestigator code on github. To run this on your machine:

$ gem install socialinvestigator
$ socialinvestigator net get_apps_json
$ socialinvestigator net page_info

It may take a while to get the responses. If you want to see everything it’s doing, use the --debug switch

$ socialinvestigator net page_info --debug

The reverse lookup can take a while, and if you want to turn that off:

$ socialinvestigator net page_info --noreverse

Read next

See also

Making a command line utility with gems and thor

Any excuse to the use the phrase “Hammer of the Gods”

Some scripts work well because they are self contained and don’t have a lot of dependancies, like the hosts on your network tracker. Others scripts have more code than fits into a single file multiple options and switches have an extensive set of dependancies And on those cases, its better to make a gem and use thor Hammer of the Gods Lets figure out how to make some command line tools and package them up so that they can be shared and used by other people.

Read more

How to track your coworkers

Simple passive network surveillance

How much information do you bleed? Ever wonder who is else is using your network? Or,who has actually showed up at the office? Networking primer The simplest thing we can do to make this work is to check to see which devices have registered themselves on the network. As devices come and go, they connect automatically, so we will have a pretty good idea if people are there or not. Phones seem to attach and detach quite frequency, probably to conserve battery, so if we are want to answer the question “is so-and-so in the office” we’ll need to add additional logic to determine how far spaced the “sighting” events are to mean that the person has left the office, rather than the phone has gone to sleep.

Read more