Scrape all you can, give nothing back

Publish date: 8 Nov, 2021
Tags: story scraping gitlab

I imagine that Jack Sparrow would say something like that when he would know about web scraping and crawling. All the data about ships and cargos would definitely be useful for him and his Black Pearl crew. But how would scraping benefit me? Ordinary developer who isn’t planning to steal a gold treasure?

Jack scraping.

Tons of websites without public API

When you are a web developer for a while and requesting restful APIs is your bread and butter you often come across some service and think to yourself “It would be really nice if I can just request these data through an API …”. This thought usually comes to my mind when working on a side project of some sorts.

We won’t offer an API but we will try to block you instead

This summer I worked on a new website for a football club. The old version had a nice little widget with all the latest results of the club’s teams. I wanted to keep this feature but as I found out the data were filled in manually each week. “There must be a way to automate this.” I thought to myself as I’m a lazy kind of programmer and moreover I know that Czech football association has the results because they display them on their website. Naively I thought that a single GET request for each team will be enough and gives me the data I need. I wanted to be nice so I would even cache the requests and only ask for them once or twice each week, which was super easy for me as I used Next.js. However after a quick search I found out that not only they don’t offer an API but they also actively block scrapers. First thing that should warn me was one deprecated project that tried to do a similar scraper but offered the data as a service for free. The author has lost his patience with continuously updating his scraper to avoid different blocking mechanisms on the FAČR site and decided to not develop the project anymore. However I found the site of the club he was primarily using this service for and there are still current results. I have two guesses. Either he fills in the data manually or he found a way to automate it. But can you defeat captcha and other obstacles on the way?

Halfway to automation

As it turns out there are some services that can solve captchas for you and in combination with Apify’s actor you could probably scrape it. However it would cost some money (captcha service, apify proxies) and as it is just a hobby project I didn’t want to spend money and the extra time to set this up. I decided to postpone the automatic results widget and I focused to match schedules instead. I still couldn’t scrape it automatically as the site is the same but I decided to manually download the HTML of all team’s schedules and automate it from here. The schedules don’t change much and when they do I receive notification through email so I can generate a new version. I’m okay with that for now and I’m sure one day I will automate it completely.

Where will we eat today

I’m sure our office isn’t the only one that must solve the hardest question each day which is “Where are we gonna eat today?”. There are a few options in the area around our office and some more came up when restaurants adapted to takeouts during covid. But what’s the daily menu in this one and what do they offer in that one? Surely you don’t want to browse through X websites and compare the menus in your head. So a new side project was set.

Don’t buy new domain, don’t buy new domain

Don’t buy the domain.

I can’t believe I resisted the urge to buy a new domain right away but I’m proud of myself. Maybe I’m annoyed by all the emails from czech domain administrator about domain name expiration. Anyway I was hooked to build a new project which would bring me some joy and would be useful for some people - at least a few of my colleagues. So I started with a scraper. I checked the website of the first restaurant, it looks simple enough to use the Cheerio scraper, let’s get some data! Write a few javascript lines (who needs typescript for side projects, right?), test on locally downloaded html file and voila, should work just fine. Next step - try it on Apify platform. Set up an actor, build it, run it, it failed. Oops. So it isn’t so easy after all.

People, write a valid HTML, please!

As I debug my failed scraper for a while, thanks to Apify’s snapshots I found out that the webpage is somehow broken, when I try to scrape it. How so? Simple answer: invalid HTML. I picked the Cheerio scraper which just fetches the HTML and then you can go through it using selectors similar to jQuery. But Cheerio fails when you supply invalid HTML as you would probably guess. Well, it took me a while to figure it out. But when I did, it was pretty simple to fix it - use a more robust Puppeteer scraper which uses a headless browser that ensures all of the HTML errors are fixed and the scraper sees the website as you would in the browser. Let’s scrape the next one.

Maybe think a little before writing code

Next I picked the student’s canteen of the University of South Bohemia. I already had a scraper for this so I just refactor it a little bit and it worked. Well I was expecting not to but I’ll take it! As I was deploying it to Apify I started to think about the bigger picture. How should I trigger the scrapers and when all of them are ready how would they trigger the website build? I couldn’t think of some elegant solution so here comes captain refactor! I merged both scrapers into one project. While doing it I moved the code to a separate git repository as well. I updated the pageFunction to decide by url what process function should be used and for each restaurant I created a separate file. Works like charm. Let’s add the last restaurant I had in mind and we are done for the moment.

To merge the scrapers turns out to be a good decision as all I have to do now is just trigger the build of the website using webhooks. It’s simple and works great.

Where the hell are my menus?

The new website seems to work great for a few days. But one morning I opened the website to see no menus scraped for the restaurants. What happened? Did both restaurants get closed? I don’t want to go to the student’s canteen today, there is nothing edible! Let’s see the scraper status - run alright, triggered website build as planned. So is there another bug? Both restaurant’s websites show the menus. Hmm strange. Let’s trigger the scrape manually and see if it helps. It did. So I investigated a little more and found out that both restaurants add menus later in the day. My scraper schedule was set to 6am and they add the menus after 9am. So I just updated the schedule time and it works again. Nice.

All that is missing is to add more restaurants with menus! And so can you, checkout

If you don’t offer API at least don’t block scraping

It’s just a small wish I have and a little warning for developers trying to block scraping of their public data. If someone wants it he/she will scrape it anyway so why not save someone some time and make the web more open? Maybe someone will build something with your data you would never think of and help someone else. If you don’t want to share it just don’t publish it at all.