Crawler for Contracts and Tenders¶
This document explains how Base provides its data and how the crawler works.
Important
Please, take precautions on using the crawler as it can generate Denial of Service (DoS) to Base database. We provide remote access to our database to avoid that.
Important
Crawling Base from scratch takes more than 2 days as of Jan. 2014.
Base database¶
Base uses the following urls to expose its data
Entity
: http://www.base.gov.pt/base2/rest/entidades/[base_id]Contract
: http://www.base.gov.pt/base2/rest/contratos/[base_id]Tender
: http://www.base.gov.pt/base2/rest/anuncios/[base_id]- List of
Country
: http://www.base.gov.pt/base2/rest/lista/paises - List of
District
: http://www.base.gov.pt/base2/rest/lista/distritos?pais=[country_base_id]; (portugal_base_id=187) - List of
Council
: http://www.base.gov.pt/base2/rest/lista/concelhos?distrito=[district_base_id]; - List of
ContractType
: http://www.base.gov.pt/base2/rest/lista/tipocontratos - List of
ProcedureType
: http://www.base.gov.pt/base2/rest/lista/tipoprocedimentos
Each url returns json
with information about the particular object.
For this reason, we have an abstract crawler for retrieving this information
and map it to this API.
What the crawler does¶
The crawler accesses Base urls using the same procedure for entities, contracts and tenders. It does the following:
- retrieves the list
c1_ids=[i*10k, (i+1)*10k]
of ids from links 4., 5. or 6.; - retrieves all ids in range
[c_ids[0], c_ids[-1]]
from our db,c2_ids
- Adds, using links 1.,2. or 3. all ids in
c1_ids
and not inc2_ids
. - Removes, using links 1.,2. or 3. all ids in
c2_ids
and not inc1_ids
. - Go to 1 with
i += 1
until it covers all contracts.
The initial value of i
is 0 when the database is populated from scratch, and is
such that only one cycle 1-5 is performed when searching for new items.
API references¶
This section introduces the different crawlers we use to crawl Base.
-
class
contracts.crawler.
ContractsStaticDataCrawler
¶ A subclass
JSONCrawler
for static data. This crawler only needs to be run once and is used to populate the database the first time.-
retrieve_and_save_all
()¶ Retrieves and saves all static data of contracts.
-
Given the size of Base database, and since it is constantly being updated, contracts, entities and tenders, use the following approach:
-
class
contracts.crawler.
DynamicCrawler
¶ An abstract subclass of
JSONCrawler
that implements the crawling procedure described in the previous section.-
object_name = None
A string with the name of the object used to name the
.json
files; to be overwritten.
-
object_url = None
The url used to retrieve data from BASE; to be overwritten.
-
object_model = None
The model to be constructed from the retrieved data; to be overwritten.
-
static
clean_data
(data)¶ Cleans
data
, returning acleaned_data
dictionary with keys being fields of theobject_model
and values being extracted fromdata
.To be overwritten by subclasses.
-
save_instance
(cleaned_data)¶ Saves or updates an instance of type
object_model
using the dictionarycleaned_data
.This method can be overwritten for changing how the instance is saved.
Returns a tuple
(instance, created)
wherecreated
isTrue
if the instance was created (and not just updated).
-
update_instance
(base_id)¶ Uses
get_json()
,clean_data()
andsave_instance()
to create or update an instance identified bybase_id
.Returns the output of
save_instance()
.
-
get_instances_count
()¶ Returns the total number of existing instances in BASE db.
-
get_base_ids
(row1, row2)¶ Returns a list of instances from BASE of length
row2 - row1
.
-
update_batch
(row1, row2)¶ Updates a batch of rows, step 2.-4. of the previous section.
-
update
(start=0, end=None, items_per_batch=1000)¶ The method retrieves count of all items in BASE (1 hit), and synchronizes items from start until min(end, count) in batches of items_per_batch.
If end=None (default), it retrieves until the last item.
if start < 0, the start is counted from the end.
Use e.g. start=-2000 for a quick retrieve of new items;
Use start=0 (default) to synchronize all items in database (it takes time!)
-
-
class
contracts.crawler.
EntitiesCrawler
¶ A subclass of
DynamicCrawler
to populateEntity
table.Overwrites
clean_data()
to clean data toEntity
.Uses:
object_directory
:'../../data/entities'
object_name
:'entity'
;object_url
:'http://www.base.gov.pt/base2/rest/entidades/%d'
object_model
:Entity
.
-
class
contracts.crawler.
ContractsCrawler
¶ A subclass of
DynamicCrawler
to populateContract
table.Overwrites
clean_data()
to clean data toContract
andsave_instance()
to also saveManytoMany
relationships of theContract
.Uses:
object_directory
:'../../data/contracts'
object_name
:'contract'
;object_url
:'http://www.base.gov.pt/base2/rest/contratos/%d'
object_model
:Contract
.
-
class
contracts.crawler.
TenderCrawler
¶ A subclass of
DynamicCrawler
to populateTender
table.Overwrites
clean_data()
to clean data toTender
andsave_instance()
to also saveManytoMany
relationships of theTender
.Uses:
object_directory
:'../../data/tenders'
object_name
:'tender'
;object_url
:'http://www.base.gov.pt/base2/rest/anuncios/%d'
object_model
:Tender
.
Crawler for Categories¶
Europe Union has a categorisation system for public contracts, CPVS, that translates a string of 8 digits into a category to be used in public contracts.
More than categories, this system is a tree with broader categories like “agriculture”, and more specific ones like “potatos”.
They provide the fixture as an XML file, which we import:
-
contracts.categories_crawler.
build_categories
()¶ Constructs the category tree of
categories
.Gets the most general categories and saves then, repeating this recursively to more specific categories until it reaches the leaves of the tree.
The categories are retrieved from the internet.