Spiderless, Web Spider on Serverless

A web spider / scraper / website change detector built with Lambda, API Gateway, DynamoDB and SNS

View on Github

spider-less

Web spider on Serverless!

About Spiderless

Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:

Technology Used For
Bulma, Buefy UI
Vue.js Front-end logic
AWS S3 Website hosting
AWS Lambda Backend API
AWS SNS Message queue
AWS DynamoDB Database
AWS API Gateway API gateway
AWS Cloudfront CDN
AWS Route 53 DNS

Architecture

serverless application architecture

API Endpoints

GET subscriptions

Description

Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).

Parameters

None

Request

curl /api/subscriptions

Response

[
  {
    "createdAt": 1544833435070,
    "targets": [
      {
        "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
        "label":"ratingCount"
      }
    ],
    "id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
    "url": "https://www.imdb.com/title/tt0111161/",
    "interval": 60
  }
]

POST subscriptions

Description

Create a new subscription to feed the spider.

Parameters

  • url (required) - Target website url
  • targets (required) - List of css selectors from which text contents are expected to be extracted
  • interval (required) - The interval (in minutes) between scrape

Request

curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"

Response

{
  "id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
  "url": "https://www.imdb.com/title/tt0111161/",
  "targets": [
    {
      "label":"ratingCount",
      "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
    }
  ],
  "interval": 60,
  "createdAt": 1544833533059,
  "updatedAt": 1544833533059
}

DELETE subscriptions

Description

Delete a subscription.

Parameters

  • id (required) - Subscription id

Request

curl -X DELETE /api/subscriptions/:id

Response

{
  "id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}

Functions List

scrape

Description

Scrape target websites and extract target contents.

Invoke

yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'

Response

[
  {
    "label": "ratingCount",
    "content": "2,025,796"
  }
]

cron

Description

Fetch subscriptions from database and filter out the ones need to be executed.

Invoke

yarn invoke:local cron

Response

None

Development

# install dependencies
yarn install

# start api server on port 8090
yarn start

# invoke function locally
yarn invoke:local function_name

# invoke remote function
yarn invoke cron function_name

Deploy

# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy