Happy Birthday Captain

Here’s one of me and Cap from around 1997 I believe whilst he was promoting his “Mad Cows and Englishmen” album.

Web Scraping – How to scrape Instragram

Targeting Instagram

I’ve blogged about web scraping before mainly with Raspberry Pi and Python (utilising BeautifulSoup).
Finding a free weekend I decided to revisit scraping and looked at Instagram.

Disclaimer

Any code herein is purely experimental and just looking at what could be achieved. It is not intended for use. And it works as of January 2018. Please respect any photographers copyright. Also, if you’re looking to do anything of note with Instagram you should use their Dev APIs. Finally, my intention was to scrape tagged content, scraping individual accounts is a bit creepy, so don’t do it you weirdo!

Scraping Scenarios

Ideally, I’d like to pick an Instagram tag and have a service download all images matching that tag. However, this appeared to be rather tricky. I couldn’t see any way to page through the results. I played with PhantomJS/CasperJS and managed to get a decent amount of content back to the client using the scrolldown() function to get round the infinite scroller. However, this was turning into something of a project, I started looking at the DOM through different User Agents etc. I just wanted a few lines of code in Linqpad that I could just play with on a wet January afternoon.

Having fallen at the first hurdle with tags I looked at whether anything similar could be achieved by targeting an individual account. I spun up my Instagram account and starting to have a look at the html.

The Code

I noticed that the “Load More” button would link to an id and that this matched an element on the DOM. This was buried in JSON with a “typename” of “GraphImage”. So I got a list of all Ids using the following regex in a “GetNextId” method:

public long GetNextId(string response)
{
MatchCollection matches = Regex.Matches(response, @"\bGraphImage\W.+?\B[0-9]+");
var lowId = long.MaxValue;
foreach (var math in matches)
{
var x = long.Parse(math.ToString().Replace(@"GraphImage"", ""id"": """, ""));
if (x < lowId) lowId = x; } return lowId; }



Var x?! Well we're just playing here.
I can then sort these Ids to get the lowest Id on the page which becomes the MaxId for the next url call.
Note 5000 is purely guesswork for the max amount of images on the account. You should scrape max amount from the account profile.

for (int i = 0; i < 5000; i++) { var maxId = GetNextId(response); Thread.Sleep(1000); // Play nicely. response = client.GetStringAsync(urlToDownload+"?max_id=" + maxId).Result; DownloadImagesInPage(response, username); }

Download the Images on the page

Now we've paged we just need to download all the images on the page using regex.....


MatchCollection matches = Regex.Matches(response, @"https?://[^/\s]+/\S+\.(jpg|png|gif)");

Foreach through the matches and basically get the file name...

var fileName = match.ToString().Substring(match.ToString().LastIndexOf('/') + 1);

Use webclient to bring down each image within the matched url and WriteAllBytes out to file. Simples.

Limitations/Future Work

I time-boxed this exercise into 2 hours so I'm aware there's room for improvement. I did notice that the code doesn't work with certain Instagram accounts but didn't have the time to take this further. Also, I know my regex could be improved. Any comments/feedback is always welcome.

Obviously, this is a bit of fun. If you're scraping Instragram respect the owner's copyright etc.