This post is very old and contains obsolete information.
Ever wonder what you can find out by looking at a url? How about physical addresses, server location, emails, phone numbers, various links to other profiles (which can in turn be structurally scraped), technology stack, and more.
|
|
Standalone code as a gist, the complete socialinvestigator code available on github or and is easily installable on your machine as a gem.
|
|
Poking around different urls can give you a sense of the corporate entities behind sites, who is actually involved, and help you track down people when you can’t find them otherwise. It’s actually hard to figure out which URL to include on this post since the data seems so personal and yet people put it out there. This takes the messy HTML that’s out there and returns structured information that you can use to explore profiles on other sites in a way that can be totally automated.
What does it do
What this code does is to first search for DNS information to see who owns the domain and if there’s any contact information associated with it. It then looks at who owns the IP address and tries to locate where it is geographically.
It then looks at the page itself to see Open Graph meta data,Twitter Card meta data and other basic SEO tags.
Finally, it looks inside the page for likely looking links to other social networks, and scans the page and HTTP metadata for clues about what underlying technology the site was built in. (The metadata for the technology fingerprinting is from the Wappalyzer project which I cobbled together a basic ruby engine for.)
And finally it takes all of the facts that it has collected, figured out which ones take priority, and prints them out.
Finding Domain info
The first thing that we do is look for the URL and try to find the domain name. The difference between a hostname and domain name is subtle, partly in some cases they are interchangeable, and partly because DNS is the second most amazing thing about the Internet. (The most truly mind-blowing thing clearly the default route, the life, liberty, and pursuit of happiness of the Internet.) A globe spanning, highly distributed database that lets 2.5 billion internet users look up any of the 4 billion potential server addresses in less than 50ms without any real centralized control isn’t exactly straightforward.
DNS manages this complexity by delegating authority for different branches of the entire namespace. The first level is called the Top Level Domains, most famous being .com, when you buy a domain name from someone they delegate authority over that name space to you. These can go deep, especially with large global organizations. The first thing we do is look for that Start of Authority (SOA) record for machine named in the URL. If we can’t find one for that machine, we look up the chain until we find something.
This looks like:
|
|
Once we’ve found the domain, we query the whois
databases to find out who has owns the domain name.
|
|
One of the challenges here is that there is no standardized format that there is no standardized way of parsing whois
responses. The whois
gem gives it a serious try:
|
|
But there’s over 500 different whois servers out there, so you won’t always get a parseable response. In that case we print out that we can’t find a parser, and we store the unparsed response in the data object as unparsed_whois
.
|
|
Finding IP and hosting information
Now we look at the IP address, and then do a reverse lookup on it to see what the server machine name is.
|
|
Sometimes interesting things are encoded in the server name, like if it’s a Rackspace cloud server vs a Rackspace static server, but we make no attempt to interpret that string.
Then we try and see where the IP address is located geographically, using freegeoip.net. If you did a lot of this it would make sense to buy a more detailed database from Maxmind but for something quick and dirty this works. Given that you need to follow the rules of the company you are in, it’s interesting to see where the servers are located.
|
|
We can also do a whois
lookup on the IP address, to see who owns that IP block. This should give us an idea of who is hosting the site. Note that we don’t even pretend to parse the whois
response here in a clever way.
|
|
Page meta data
Now we load up the page, and look for some basic stuff. The first thing that we do is load the meta tags into something more accessible.
|
|
Now we load up some basic SEO info, including if there are any feeds for this site’s content.
|
|
Twitter Cards
Twitter Card meta data is a way to control how your data gets displayed on twitter, which has the benefit of defining some summary meta data around the social graph. One thing thing to note is that twitter:creator
is the author of this page, while twitter:site
is the twitter account for the overall site.
|
|
Open Graph
Open Graph meta data is really about what your link looks like when someone shares it on Facebook.
|
|
Social Page Links
We search for social links:
Service | Regex |
---|---|
/mailto:(.*@.*\..*)/ | |
/twitter.com\/[^\/]*$/ | |
/linkedin.com/ | |
/instagram.com/ | |
/facebook.com\/[^\/]*$/ | |
Google+ | /plus.google.com\/[^\/]*$/ |
Github | /github.com\/[^\/]*$/ |
For Twitter, Facebook, and Google+ we are only letting through links that have a simple query string, since for the most part this means that it’s the user’s ID.
Parsing Twitter Shares and Intents
We then look for Twitter Share links, and try and parse out the user names found in there.
|
|
There are also twitter intent links:
|
|
Technology Finger Prints
The final thing we do is to load the apps.json
file from Wappalyzer which is a cross-platform utility that uncovers the technologies used on websites. This has a list of regex for the header tags, meta tags, scripts and other parts of the html to make guesses about which technology is in place. What is in place is very rudimentary, but it gives a general sense of what is used to made the site.
Installation
The standalone code as a gist, and you can check out the complete socialinvestigator code on github. To run this on your machine:
|
|
It may take a while to get the responses. If you want to see everything it’s doing, use the --debug
switch
|
|
The reverse lookup can take a while, and if you want to turn that off:
|
|