Summary
The HDX Python Scraper Library is designed to enable you to easily develop code that assembles data from one or more tabular sources that can be csv, xls, xlsx or JSON. It uses a YAML file that specifies for each source what needs to be read and allows some transformations to be performed on the data. The output is written to JSON, Google sheets and/or Excel and includes the addition of Humanitarian Exchange Language (HXL) hashtags specified in the YAML file. Custom Python scrapers can also be written that conform to a defined specification and the framework handles the execution of both configurable and custom scrapers.
Information
This library is part of the Humanitarian Data Exchange (HDX) project. If you have humanitarian related data, please upload your datasets to HDX.
The code for the library is here. The library has detailed, but currently outdated API documentation which can be found in the menu at the top.
To use the optional functions for outputting data from Pandas to JSON, Excel etc., install with:
pip install hdx-python-scraper[pandas]
Breaking Changes
From 2.3.0, resource name is used when available instead of creating name from url so tests that use saved data from the Read class may break. file_type parameters in various Read methods renamed to format.
From 2.1.2, Python 3.7 no longer supported
From 2.0.1, all functions in outputs.update_tabs are methods in the new Writer class in utilities.writer
From 2.0.0, source dates can be formatted and the default format looks like this: "Mar 21, 2022"
From 1.8.9, date handling uses timezone aware dates instead of naive dates and defaults to UTC
From 1.8.7, FileCopier -> ResourceDownloader, get_scrapers calls in Aggregator, ResourceDownloader and TimeSeries -> Runner.addXXX, read_resource -> download_resource, add_to_run -> force_add_to_run
From 1.8.3, changes to update_sources and update_regional
From 1.7.5, new Read class to use instead of Retrieve class
From 1.6.7, retrievers are generated up front
From 1.6.6, configuration fields for output JSON renamed to additional_inputs
,
output
and additional_outputs
.
From 1.6.0, major renaming of configuration fields, mostly dropping _cols eg. input
instead of input_cols
From 1.4.4, significant refactor that adds custom scraper support and a runner class.
Scraper Framework Configuration
A full project showing how the scraper framework is used in a real world scenario is here. It is very helpful to look at that project to see a full working setup that demonstrates usage of many of the features of this library.
Input and Processing Setup
The library is set up broadly as follows:
with temp_dir() as temp_folder:
today = parse_date("2020-10-01")
Read.create_readers(
temp_folder,
"saved_data",
temp_folder,
save,
use_saved,
hdx_auth=configuration.get_api_key(),
header_auths=header_auths,
basic_auths=basic_auths,
param_auths=param_auths,
today=today,
)
...
Fallbacks.add(json_path)
runner = Runner(("AFG",), today)
keys = runner.add_configurables(scraper_configuration, "national")
education_closures = EducationClosures(
datasetinfo, today, countries, region
)
runner.add_custom(education_closures)
adminlevel = AdminLevel(configuration, admin_level=2)
keys = runner.add_configurables(scraper_configuration, "subnational")
runner.run(prioritise_scrapers=("population_national", "population_subnational"))
results = runner.get_results()["national"]
assert results["headers"] == [("header1", header2"...), ("#hxltag1", "#hxltag2",...)]
assert results["values"] == [{"AFG": 38041754, "PSE": ...}, {"AFG": 123, "PSE": ...}, ...]
assert results["sources"] == [("#population", "2020-10-01", "World Bank", "https://..."), ...]
The framework is configured by passing in a configuration. Typically this will come from
a YAML file such as config/project_configuration.yaml
. To use the current date/time,
there is a now_utc() function from the dependency HDX Python Utilities.
Read Class
The Read class inherits from Retrieve from the HDX Python Utilities library. The
Retrieve class inherits from BaseDownload which specifies a number of standard
methods that all downloaders should have: download_file
, download_text
,
download_yaml
, download_json
and get_tabular_rows
.
The Read class is a utility for reading metadata and data from websites and APIs with the ability to set up authorisations up front with Read objects held in a dictionary for subsequent use. Read objects can save copies of the data being downloaded or read from pre-saved data for tests.
Note that a single Read object cannot be used in parallel: each download operation
must be completed before starting the next. For example, you will get an error if you
try to call get_tabular_rows
twice with different urls to get two iterators, then
afterwards iterate through those iterators. The first iteration must be completed before
obtaining another iterator.
The first parameter of the create_readers
method is the location of fallback data
(if available). The second specifies to where data should be saved if desired. The third
parameter is the path of a temporary folder. If the downloaded data should be saved, the
fourth optional parameter save
should be True. If a test is being run against
pre-saved data, the fifth optional parameter use_saved
should be True.
Additional readers are generated if any of header_auths, basic_auths or extra_params
are populated. header_auths and basic_auths are dictionaries of form
{"scraper name": "auth", ...}
. extra_params is of form
{"scraper name": {"key": "auth", ...}, ...}
.
Scrapers that inherit from BaseScraper
can call the method get_reader(SCRAPER_NAME)
to obtain the Read object associated with the supplied SCRAPER_NAME
. If no special
authorisations are needed for the website the scraper accesses, then the default Read
object is returned.
AdminLevel Class
More about the AdminLevel class can be found in the HDX Python Country library. Briefly, that class accepts a configuration (which is shown below in YAML syntax) as follows:
country_name_mappings
defines the country name overrides we use (ie. where we deviate
from the names in the OCHA countries and territories file).
country_name_mappings:
PSE: "occupied Palestinian territory"
BOL: "Bolivia"
admin_info
defines the admin level names and pcodes.
admin_info:
- {pcode: AF01, name: Kabul, iso2: AF, iso3: AFG, country: Afghanistan}
- {pcode: AF02, name: Kapisa, iso2: AF, iso3: AFG, country: Afghanistan}
country_name_mappings
defines mappings from country name to iso3.
country_name_mappings:
"Congo DR": "COD"
"CAR": "CAF"
"oPt": "PSE"
admin_name_mappings
defines mappings from admin name to pcode
admin_name_mappings:
"Nord-Ouest": "HT09"
"nord-ouest": "HT09"
...
adm_name_replacements
defines some find and replaces that are done to improve
automatic admin name matching to pcode.
adm_name_replacements:
" urban": ""
"sud": "south"
"ouest": "west"
...
admin_fuzzy_dont
defines admin names to ignore in fuzzy matching.
admin_fuzzy_dont:
- "nord"
...
Runner Class
The Runner constructor takes various parameters.
The first parameter is a list of country iso3s.
The second is the datetime you want to use for "today". If you pass None, it will use the current datetime.
The third is an optional object of class ErrorsOnExit from the HDX Python Utilities library. This class collects and outputs errors on exit.
The last optional parameter is a list of scrapers to run.
The method add_configurables
is used to add scrapers that are configured from a YAML
configuration. For subnational levels, an AdminLevel object must be supplied. The method
add_custom
is used to add a custom scraper and add_customs
for multiple custom
scrapers - these are scrapers written in Python that inherit the BaseScraper
class. If
running specific scrapers rather than all, and you want to force the inclusion of the
scraper in the run regardless of the specific scrapers given, the parameter
force_add_to_run
should be set to True.
It is possible to add a post run step to a scraper that has been set up using:
runner.add_post_run("SCRAPER_NAME", function_to_call)
Scrapers are run using the run
method and results obtained using
get_results
. The results dictionary has keys for each output level and values
which are dictionaries with keys headers, values, sources and fallbacks.
Headers is a tuple of (column headers, hxl hashtags). Values, sources and
fallbacks are all lists. The following calls show what might be expected in the
returned results:
results = runner.get_results()["national"]
assert results["headers"] == [("header1", header2"...), ("#hxltag1", "#hxltag2",...)]
assert results["values"] == [{"AFG": 38041754, "PSE": ...}, {"AFG": 123, "PSE": ...}, ...]
assert results["sources"] == [("#population", "2020-10-01", "World Bank", "https://..."), ...]
For the HAPI project and other projects for which such output might be useful,
there are calls get_hapi_metadata
and get_hapi_results
. get_hapi_metadata
returns a dictionary that maps from dataset ids to a dictionary. The dictionary
has keys for dataset metadata and a key resources under which is a dictionary
that maps from resource ids to resource metadata. Example output is shown
below:
{"3d9b037f-5112-4afd-92a7-190a9082bd80": {"hdx_id": "3d9b037f-5112-4afd-92a7-190a9082bd80",
"hdx_stub": "cod-ps-eth",
"provider_code": "522a7e16-3ba7-4649-b327-df81fd6dd689",
"provider_name": "ocha-ethiopia",
"reference_period": {"enddate": datetime.datetime(2023, 9, 21, 23, 59, 59, 385291, tzinfo=datetime.timezone.utc),
"enddate_str": "2023-09-21T23:59:59+00:00",
"ongoing": True,
"startdate": datetime.datetime(2022, 1, 5, 0, 0, tzinfo=datetime.timezone.utc),
"startdate_str": "2022-01-05T00:00:00+00:00"},
"resources": {"bfb57304-3e22-498f-8a82-a345a8976852": {"download_url": "https://data.humdata.org/dataset/3d9b037f-5112-4afd-92a7-190a9082bd80/resource/bfb57304-3e22-498f-8a82-a345a8976852/download/eth_admpop_adm2_2022_v2.csv",
"filename": "eth_admpop_adm2_2022_v2.csv",
"format": "CSV",
"hdx_id": "bfb57304-3e22-498f-8a82-a345a8976852",
"update_date": datetime.datetime(2022, 8, 4, 18, 15, 44, tzinfo=datetime.timezone.utc)}},
"title": "Ethiopia - Subnational "
"Population Statistics"},...}
get_hapi_results
returns a dictionary where key is HDX dataset id and value
is a dictionary that has HAPI dataset metadata as well as a results key. The
value associated with the results key is a dictionary where each key is an
admin level. Each admin level key has a value dictionary with headers, values
and HAPI resource metadata. Headers is a tuple of (column headers, hxl
hashtags). Values is a list. HAPI resource metadata is a dictionary. Example
output is shown below:
{"8520e386-9263-48c9-b1bf-b2349e019fbb": {"hdx_id": "8520e386-9263-48c9-b1bf-b2349e019fbb",
"hdx_stub": "cod-ps-col",
"provider_code": "95aa8d05-b110-4607-9330-f2a779885493",
"provider_name": "unfpa",
"reference_period": {"enddate": datetime.datetime(2023, 9, 21, 23, 59, 59, 385291, tzinfo=datetime.timezone.utc),
"enddate_str": "2023-09-21T23:59:59+00:00",
"ongoing": True,
"startdate": datetime.datetime(2023, 8, 8, 0, 0, tzinfo=datetime.timezone.utc),
"startdate_str": "2023-08-08T00:00:00+00:00"},
"results": {"adminone": {"hapi_resource_metadata": {"download_url": "https://data.humdata.org/dataset/8520e386-9263-48c9-b1bf-b2349e019fbb/resource/e8f7fb08-af9c-4bdf-8a49-a54c56a4a1b0/download/col_admpop_adm1_2023.csv",
"filename": "col_admpop_adm1_2023.csv",
"format": "CSV",
"hdx_id": "e8f7fb08-af9c-4bdf-8a49-a54c56a4a1b0",
"update_date": datetime.datetime(2023, 8, 8, 19, 57, 17, tzinfo=datetime.timezone.utc)},
"headers": (["T_TL",
"T_100Plus"],
["#population+total",
"#population+age_100_plus+total"]),
"values": ({"CO05": "6994792",
"CO08": "2835509",
...},
{"CO05": "3612147",
"CO08": "1452742",
...},...)},
"admintwo": {"hapi_resource_metadata": {"download_url": "https://data.humdata.org/dataset/8520e386-9263-48c9-b1bf-b2349e019fbb/resource/76e12f52-af0d-45b2-8024-e6b0e63913c4/download/col_admpop_adm2_2023.csv",
"filename": "col_admpop_adm2_2023.csv",
"format": "CSV",
"hdx_id": "76e12f52-af0d-45b2-8024-e6b0e63913c4",
"update_date": datetime.datetime(2023, 8, 8, 19, 57, 19, tzinfo=datetime.timezone.utc)},
"headers": (["T_TL",
"T_100Plus"],
["#population+total",
"#population+age_100_plus+total"]),
"values": ({"CO05001": "2653729",
"CO05002": "21246",
...},
{"CO05001": "1400979",
"CO05002": "10112",
...}, ...)},
"title": "Colombia - Subnational "
"Population Statistics"}}
Fallbacks
Fallbacks can be defined which are used for example when there is a network issue. This
is done using the Fallbacks.add()
call. This can only be done if there is a JSON
output defined. Fallbacks.add()
takes a few parameters.
The first parameter is a path to the output JSON.
The second optional parameter is a mapping from level name to key in the JSON. The default is:
{
"global": "global_data",
"regional": "regional_data",
"national": "national_data",
"subnational": "subnational_data",
}
The third optional parameter specifies the key where the sources can be found,
defaulting to sources
.
The fourth parameter, also optional, specifies a mapping from level to admin name. The default is:
{
"global": "value",
"regional": "#region+name",
"national": "#country+code",
"subnational": "#adm1+code",
}
Scrapers
Custom Scrapers
It is possible to define custom scrapers written in Python which must inherit
BaseScraper,
calling its constructor and providing a run
method. Other methods where a default
implementation has been provided can be overridden such as add_sources
and
add_population
. There are also two hooks for running steps at particular points.
run_after_fallbacks
is executed after fallbacks are used and post_run
is executed
after running whether or not fallbacks were used.
The structure is broadly as follows:
class MyScraper(BaseScraper):
def __init__(
self, datasetinfo: Dict, today, countryiso3s, downloader
):
super().__init__(
"scraper_name",
datasetinfo,
{
"national": (("Header1",), ("#hxltag1",)),
"regional": (("Header1",), ("#hxltag1",),),
},
source_configuration=Sources.create_source_configuration(
admin_sources=True),
)
self.today = today
self.countryiso3s = countryiso3s
self.downloader = downloader
def run(self) -> None:
headers, iterator = read(
self.downloader, self.datasetinfo
)
output_national = self.get_values("national")[0]
...
As can be seen above, headers take the form of a mapping from a level such as "national" to a tuple of column headers and HXL hashtags. Values are populated by the scraper as it runs and are of the form below where each dictionary would represent one column in the output:
{"national": ({"AFG": 1.2, "PSE": 1.4}, {"AFG": 123, "PSE": 241}, ...})}
Sources are also populated and take the form below where each tuple includes the source HXL hashtag, source date, source and source url:
{"national": [("#food-prices", "2022-07-15", "WFP", "https://data.humdata.org/dataset/global-wfp-food-prices"), ...]
In the above code, a source configuration is supplied with admin_sources
set to True,
so sources are output per admin unit (eg. per country) - in this case, the admin unit is
added as an attribute to the HXL tag (eg. a country iso3 code like +AFG).
The code earlier would go on to populate the dictionary output_national
which is one
dictionary in values representing one column. It is a mapping from national admin names
(ie. countries) to values. output_regional
would also be populated. It is a mapping
from regions to values. In this case, since national and regional each have only one
header and HXL hashtag, there is only one dictionary to populate for each.
An example of a custom scraper can be seen here.
An example of overriding add_sources
to customise the source information that is
output is as follows:
def add_sources(self) -> None:
reader = self.get_reader()
hxltags = self.get_headers("national")[1]
datasetinfo = None
for countryiso3 in self.countryiso3s:
countryname = Country.get_country_name_from_iso3(countryiso3).lower()
datasetinfo = {
"dataset": f"fts-requirements-and-funding-data-for-{countryname}",
"source": self.datasetinfo["source"],
"format": "csv",
}
reader.read_hdx_metadata(datasetinfo)
self.add_hxltag_sources(
hxltags, datasetinfo=datasetinfo, suffix_attributes=(countryiso3,)
)
self.datasetinfo["source_date"] = datasetinfo["source_date"]
In this example, there is a source per country. In other words instead of a source for "#value+funding+total+usd", there are multiple sources "#value+funding+total+usd+COUNTRYISO3" eg. "#value+funding+total+usd+eth". The source information comes from datasets of the form "fts-requirements-and-funding-data-for-{countryname}".
Configurable Scrapers
Configurable scrapers take their configuration from a dictionary usually provided by reading from a YAML file. They use that information to work out headers, values and sources which can be later used to populate an output such as a Google Sheet.
scraper_tabname in the configuration YAML defines a set of configurable scrapers that use the framework and produce data for the tab tabname which typically corresponds to a level like national or subnational eg.
scraper_national:
…
It is helpful to look at a few example configurable scrapers to see how they are configured:
The economicindex configurable scraper reads the dataset
“covid-19-economic-exposure-index” on HDX, taking from it dataset source,
time period and using the url of the dataset in HDX as the source url. (In HDX data
explorers, these are used by the DATA links.) The scraper framework finds the first
resource that is of format xlsx
, reads the “economic exposure” sheet and looks for the
headers in row 1 (by default). Note that it is possible to specify a specific resource
name using the key resource
instead of searching for the first resource of a
particular format.
admin
defines the column or columns in which to look for admin information. As this
is a national level scraper, it uses the "Country" column. input
specifies the
column(s) to use for values, in this case “Covid 19 Economic exposure index”.
output
and output_hxl
define the header name(s) and HXL tag(s) to
use for the input
.
economicindex:
dataset: "covid-19-economic-exposure-index"
format: "xlsx"
sheet: "economic exposure"
admin:
- "Country"
input:
- "Covid 19 Economic exposure index"
output:
- "EconomicExposure"
output_hxl:
- "#severity+economic+num"
The casualties configurable scraper reads from a file that has data for only one admin
unit which is specified using admin_single. The latest row by date is obtained by
specifying date
and date_type
(which can be date, year or int):
casualties:
source: "OHCHR"
dataset: "ukraine-key-figures-2022"
format: "csv"
headers: 2
date: "Date"
date_type: "date"
admin_single: "UKR"
input:
- "Civilian casualities(OHCHR) - Killed"
- "Civilian casualities(OHCHR) - Injured"
output:
- "CiviliansKilled"
- "CiviliansInjured"
output_hxl:
- "#affected+killed"
- "#affected+injured"
The population configurable scraper configuration directly provides metadata for source,
source_url and the download location given by url only taking the source date from the
dataset. The scraper pulls subnational data so admin defines both a country column
alpha_3
and an admin 1 pcode column ADM1_PCODE
. Running this scraper will result in
population_lookup
in the BaseScraper
being populated with key value pairs.
transform
defines operations to be performed on each value in the column. In
this case, the value is converted to either an int or float if it is possible.
population:
source: "Multiple Sources"
source_url: "https://data.humdata.org/search?organization=worldpop&q=%22population%20counts%22"
dataset: "global-humanitarian-response-plan-covid-19-administrative-boundaries-and-population-statistics"
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vS3_uBOV_uRDSxOggfBus_pkCs6Iw9lm0nAxzwG14YF_frpm13WPKiM1oNnQ9zrUA/pub?gid=1565793974&single=true&output=csv"
format: "csv"
admin:
- "alpha_3"
- "ADM1_PCODE"
input:
- "POPULATION"
transform:
"POPULATION": "get_numeric_if_possible(POPULATION)"
output:
- "Population"
output_hxl:
- "#population"
The travel configurable scraper reads values from the “info” and “published” columns of
the source. input_append
defines any columns where if the same admin appears again (in
this case, the same country), then that data is appended to the existing. input_keep
defines any columns where if the same admin appears again, the existing value is kept
rather than replaced.
travel:
dataset: "covid-19-global-travel-restrictions-and-airline-information"
format: "csv"
admin:
- "iso3"
input:
- "info"
- "published"
input_append:
- "info"
input_keep:
- "published"
output:
- "TravelRestrictions"
- "TravelRestrictionsPublished"
output_hxl:
- "#severity+travel"
- "#severity+date+travel"
The operational presence scraper reads organisation and sector information.
list
defines columns for which a list of values corresponds to an admin unit.
What is returned per admin unit is a list not rather than just a single value
such as a number or string.
operational_presence_afg:
dataset: "afghanistan-who-does-what-where-january-to-march-2023"
resource: "afghanistan-3w-operational-presence-january-march-2023.csv"
format: "csv"
headers: 1
use_hxl: True
admin:
- ~
- "#adm2+code"
admin_exact: True
input:
- "#org +name"
- "#org +acronym"
- "#org +type +name"
- "#sector +cluster +name"
list:
- "#org +name"
- "#org +acronym"
- "#org +type +name"
- "#sector +cluster +name"
output:
- "org_name"
- "org_acronym"
- "org_type_name"
- "sector_name"
output_hxl:
- "#org+name"
- "#org+acronym"
- "#org+type+name"
- "#sector+name"
The gam configurable scraper reads from a spreadsheet that has a multiline header
(headers defined as rows 3 and 4). Experimentation is often needed with row numbers
since in my experience, they are sometimes offset from the real row numbers seen when
opening the spreadsheet. date
defines a column that contains date information and
date_type
specifies in what form the date information is held eg. as a date, a year or
an int. The scraper will for each admin, obtain the data (in this case the “National
Point Estimate”) for the latest date up to the current date (unless ignore_future_date
is set to False then future dates will be allowed).
gam:
dataset: "world-global-expanded-database-on-severe-wasting"
format: "xlsx"
sheet: "Trend"
headers:
- 3
- 4
admin:
- "ISO"
date: "Year*"
date_type: "year"
input:
- "National Point Estimate"
output:
- "Malnutrition Estimate"
output_hxl:
- "#severity+malnutrition+num+national"
The covidtests configurable scraper gets “new_tests” and “new_tests_per_thousand” for the latest date where a date_condition is satisfied which is that “new_tests” is a value greater than zero. Here, the default sheet of 1 and the default headers rows of 1 are assumed. These defaults apply for both xls and xlsx.
covidtests:
dataset: "total-covid-19-tests-performed-by-country"
format: "xlsx"
date: "date"
date_type: "date"
date_condition: "new_tests is not None and new_tests > 0"
admin:
- "iso_code"
input:
- "new_tests"
- "new_tests_per_thousand"
output:
- "New Tests"
- "New Tests Per Thousand"
output_hxl:
- "#affected+tested"
- "#affected+tested+per1000"
The oxcgrt configurable scraper reads from a data source that has HXL tags and these can
be used instead of the header names provided use_hxl is set to True. By default all the
HXLated columns are read with the admin related ones inferred and the rest taken as
values except if defined as a date
.
oxcgrt:
dataset: "oxford-covid-19-government-response-tracker"
format: "csv"
use_hxl: True
date: "#date"
date_type: "date"
In the imperial configurable scraper, output
and output_hxl
are defined
which specify which columns and HXL tags in the HXLated file should be used rather than
using all HXLated columns.
imperial:
dataset: "imperial-college-covid-19-projections"
format: "xlsx"
use_hxl: True
output:
- "Imp: Total Cases(min)"
- "Imp: Total Cases(max)"
- "Imp: Total Deaths(min)"
- "Imp: Total Deaths(max)"
output_hxl:
- "#affected+infected+min+imperial"
- "#affected+infected+max+imperial"
- "#affected+killed+min+imperial"
- "#affected+killed+max+imperial"
The idmc configurable scraper reads 2 HXLated columns defined in input
. In
transform
, a cast to int is performed if the value is not None or it is set to
0. process
defines new column(s) that can be combinations of the other columns in
input
. In this case, process
specifies a new column which sums the 2
columns in input
. That new column is given a header and a HXL tag (in
output
and output_hxl
).
idmc:
dataset: "idmc-internally-displaced-persons-idps"
format: "csv"
use_hxl: True
date: "#date+year"
date_type: "year"
input:
- "#affected+idps+ind+stock+conflict"
- "#affected+idps+ind+stock+disaster"
transform:
"#affected+idps+ind+stock+conflict": "int(#affected+idps+ind+stock+conflict) if #affected+idps+ind+stock+conflict else 0"
"#affected+idps+ind+stock+disaster": "int(#affected+idps+ind+stock+disaster) if #affected+idps+ind+stock+disaster else 0"
process:
- "#affected+idps+ind+stock+conflict + #affected+idps+ind+stock+disaster"
output:
- "TotalIDPs"
output_hxl:
- "#affected+displaced"
The needs configurable scraper takes data for the latest available date for each country.
subsets
allows the definition of multiple indicators by way of filters. A filter is
defined for each indicator (in this case there is one) which contains one or more
filters in Python syntax. Column names can be used directly and if all are not already
specified in input
and date
, then all should be put as a list under the key
filter_cols
.
needs:
dataset: "global-humanitarian-overview-2020-figures"
format: "xlsx"
sheet: "Raw Data"
headers: 1
admin:
- "Country Code"
date: "Year"
date_type: "year"
filter_cols:
- "Metric"
- "PiN Value for Dataviz"
subsets:
- filter: "Metric == 'People in need' and PiN Value for Dataviz == 'yes'"
input:
- "Value"
output:
- "PeopleInNeed"
output_hxl:
- "#affected+inneed"
The population configurable scraper matches country code only using exact matching
(admin_exact
is set to True) rather than the default which tries fuzzy matching in the
event of a failure to match exactly. This is useful when matching produces false
positives which is very rare so is usually not needed. It populates the
population_lookup
dictionary of BaseScraper
.
population:
source: "World Bank"
source_url: "https://data.humdata.org/organization/world-bank-group"
url: "http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=excel&dataformat=list"
format: "xls"
sheet: "Data"
headers: 3
admin:
- "Country Code"
admin_exact: True
date: "Year"
date_type: "year"
date_condition: "Value is not None"
input:
- "Value"
output:
- "Population"
output_hxl:
- "#population"
The configurable scraper below which is intended to produce global data pulls out only
the "WLD" value from the input data. It populates the population_lookup
dictionary of
BaseScraper
using the key global
taken from population_key
which must be defined
at the top level not in a subset.
source: "World Bank"
source_url: "https://data.humdata.org/organization/world-bank-group"
url: "tests/fixtures/API_SP.POP.TOTL_DS2_en_excel_v2_1302508_LIST.xls"
format: "xls"
sheet: "Data"
headers: 3
sort:
keys:
- "Year"
admin:
- "Country Code"
admin_filter:
- "WLD"
date: "Year"
date_type: "year"
input:
- "Value"
population_key: "global"
output:
- "Population"
output_hxl:
- "#population"
The covid tests configurable scraper applies a prefilter
to the data that only
processes rows where the value in the column "new_tests" is not None and is greater than
zero. If "new_tests" was not specified in input
or date
, then it would need
to be under a key filter_cols
.
covidtests:
source: "Our World in Data"
dataset: "total-covid-19-tests-performed-by-country"
url: "tests/fixtures/owid-covid-data.xlsx"
format: "xlsx"
prefilter: "new_tests is not None and new_tests > 0"
...
The sadd configurable scraper reads data from the dataset
“covid-19-sex-disaggregated-data-tracker”. It filters that data using data from another
file, the url of which is defined in external_filter
. Specifically, it cuts down the
sadd data to only include countries listed in the “#country+code+v_iso2” column of the
external_filter
file.
sadd:
dataset: "covid-19-sex-disaggregated-data-tracker"
format: "csv"
external_filter:
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vR9PhPG7-aH0EkaBGzXYlrO9252gqs-UuKIeDQr9D3pOLBOdQ_AoSwWi21msHsdyT7thnjuhSY6ykSX/pub?gid=434885896&single=true&output=csv"
hxl:
- "#country+code+v_iso2"
...
The fsnwg configurable scraper first applies a sort to the data it reads. The reverse
sort is based on the keys “reference_year” and “reference_code”. admin
defines a
country column "adm0_pcod3" and three admin 1 level columns (“adm1_pcod2”, “adm1_pcod3”,
“adm1_name”) which are examined consecutively until a match with the internal admin 1 is
made.
date
is comprised of the amalgamation of two columns “reference_year” and
“reference_code” (corresponding to the two columns which were used for sorting earlier).
sum
is used to sum values in a column. For example, the formula
get_fraction_str(phase3, population)
takes the sum of all phase 3 values for an
admin 1 and divides it by the sum of all population values for an admin 1.
mustbepopulated
determines if values are included or not and is by default False. If
it is True, then only when all columns in input
for the row are populated is the
value included. This means that when using multiple summed columns together in a
formula, the number of values that were summed in each column will be the same. The last
formula uses “#population” which is replaced by the population for the admin unit (which
is taken from the population_lookup
variable of BaseScraper
).
fsnwg:
dataset: "cadre-harmonise"
format: "xlsx"
sort:
reverse: True
keys:
- "reference_year"
- "reference_code"
admin:
- "adm0_pcod3"
- - "adm1_pcod2"
- "adm1_pcod3"
- "adm1_name"
date:
- "reference_year"
- "reference_code"
date_type: "int"
filter_cols:
- "chtype"
subsets:
- filter: "chtype == 'current'"
input:
- "phase3"
- "phase4"
- "phase5"
- "phase35"
- "population"
transform:
"phase3": "float(phase3)"
"phase4": "float(phase4)"
"phase5": "float(phase5)"
"phase35": "float(phase35)"
"population": "float(population)"
sum:
- formula: "get_fraction_str(phase3, population)"
mustbepopulated: True
- formula: "get_fraction_str(phase4, population)"
mustbepopulated: True
- formula: "get_fraction_str(phase5, population)"
mustbepopulated: True
- formula: "get_fraction_str(phase35, population)"
mustbepopulated: True
- formula: "get_fraction_str(population, #population)"
mustbepopulated: True
output:
- "FoodInsecurityCHP3"
- "FoodInsecurityCHP4"
- "FoodInsecurityCHP5"
- "FoodInsecurityCHP3+"
- "FoodInsecurityCHAnalysed"
output_hxl:
- "#affected+ch+food+p3+pct"
- "#affected+ch+food+p4+pct"
- "#affected+ch+food+p5+pct"
- "#affected+ch+food+p3plus+pct"
- "#affected+ch+food+analysed+pct"
The who_subnational configurable scraper defines two values in input_ignore_vals
which
if found are ignored. Since mustbepopulated
is True, then only when all columns in
input
for the row are populated and do not contain either “-2222” or “-4444” is
the value included in the sum of any column used in sum
.
who_subnational:
source: "WHO"
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vRfjaIXE1hvEIXD66g6cuCbPrGdZkx6vLIgXO_znVbjQ-OgwfaI1kJPhxhgjw2Yg08CmtBuMLAZkTnu/pub?gid=337443769&single=true&output=csv"
format: "csv"
admin:
- "iso"
- "Admin1"
date: "Year"
date_type: "year"
filter_cols:
- "Vaccine"
subsets:
- filter: "Vaccine == 'HepB1'"
input:
- "Numerator"
- "Denominator"
input_ignore_vals:
- "-2222"
- "-4444"
transform:
Numerator: "float(Numerator)"
Denominator: "float(Denominator)"
sum:
- formula: "get_fraction_str(Numerator, Denominator)"
mustbepopulated: True
output:
- "HepB1 Coverage"
output_hxl:
- "#population+hepb1+pct+vaccinated"
The access configurable scraper provides different sources for each HXL tag by providing
dictionaries instead of strings in source
and source_url
. It maps specific HXL tags
by key to sources or falls back on a “default_source” and “default_url” for all
unspecified HXL tags.
access:
source:
"#access+visas+pct": "OCHA"
"#access+travel+pct": "OCHA"
"#event+year+previous+num": "Aid Workers Database"
"#event+year+todate+num": "Aid Workers Database"
"#event+year+previous+todate+num": "Aid Workers Database"
"#activity+cerf+project+insecurity+pct": "OCHA"
"#activity+cbpf+project+insecurity+pct": "OCHA"
"#population+education": "UNESCO"
"default_source": "Multiple sources"
source_url:
"#event+year+previous+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
"#event+year+todate+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
"#event+year+previous+todate+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
"default_url": "https://docs.google.com/spreadsheets/d/e/2PACX-1vRSzJzuyVt9i_mkRQ2HbxrUl2Lx2VIhkTHQM-laE8NyhQTy70zQTCuFS3PXbhZGAt1l2bkoA4_dAoAP/pub?gid=1565063847&single=true&output=csv"
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vRSzJzuyVt9i_mkRQ2HbxrUl2Lx2VIhkTHQM-laE8NyhQTy70zQTCuFS3PXbhZGAt1l2bkoA4_dAoAP/pub?gid=1565063847&single=true&output=csv"
format: "csv"
use_hxl: True
transform:
"#access+visas+pct": "get_numeric_if_possible(#access+visas+pct)"
"#access+travel+pct": "get_numeric_if_possible(#access+travel+pct)"
"#activity+cerf+project+insecurity+pct": "get_numeric_if_possible(#activity+cerf+project+insecurity+pct)"
"#activity+cbpf+project+insecurity+pct": "get_numeric_if_possible(#activity+cbpf+project+insecurity+pct)"
"#population+education": "get_numeric_if_possible(#population+education)"
The field admin_filter
allows overriding the country iso 3 codes and admin 1 pcodes for
the specific configurable scraper so that only those specified in admin_filter
are used:
gam:
source: "UNICEF"
url: "tests/fixtures/unicef_who_wb_global_expanded_databases_severe_wasting.xlsx"
format: "xlsx"
sheet: "Trend"
headers:
- 3
- 4
flatten:
- original: "Region {{1}} Region Name"
new: "Region Name"
- original: "Region {{1}} Point Estimate"
new: "Region Point Estimate"
admin:
- "ISO"
- "Region Name"
admin_filter:
- ["AFG"]
- ["AF09", "AF24"]
date: "Year*"
date_type: "year"
input:
- "Region Point Estimate"
output:
- "Malnutrition Estimate"
output_hxl:
- "#severity+malnutrition+num+subnational"
The date_level
field enables reading of data containing dates where the latest date
must be used at a particular level such as per country or per admin 1. For example, we
might want to produce a global value by summing the latest available value per country
as shown below. We have specified a date and sum. The library will apply a sort
by date (since we have not specified one) to ensure correct results. This would also
happen if we had used process
or append_cols
.
ourworldindata:
source: "Our World in Data"
url: "tests/fixtures/ourworldindata_vaccinedoses.csv"
format: "csv"
use_hxl: True
admin:
- "#country+code"
date: "#date"
date_type: "date"
date_level: "national"
input:
- "#total+vaccinations"
sum:
- formula: "number_format(#total+vaccinations, format='%.0f')"
output:
- "TotalDosesAdministered"
output_hxl:
- "#capacity+doses+administered+total"
An example of combining admin_filter
, date
and date_level
is getting the latest
global value in which global data has been treated as a special case of national data by
using "OWID_WRL" in the "#country+code" field:
ourworldindata:
source: "Our World in Data"
url: "tests/fixtures/ourworldindata_vaccinedoses.csv"
format: "csv"
use_hxl: True
admin:
- "#country+code"
admin_filter:
- "OWID_WRL"
date: "#date"
date_type: "date"
date_level: "national"
input:
- "#total+vaccinations"
output:
- "TotalDosesAdministered"
output_hxl:
- "#capacity+doses+administered+total"
This filtering for "OWID_WRL" can be more simply achieved by using a prefilter as in the
below example. This configurable scraper also outputs a calculated column using
process
. That column is evaluated using population data from population_lookup
of BaseScraper
using the key global
taken from population_key
which must be
defined in the subset not at the top level.
ourworldindata:
source: "Our World in Data"
url: "tests/fixtures/ourworldindata_vaccinedoses.csv"
format: "csv"
use_hxl: True
prefilter: "#country+code == 'OWID_WRL'"
date: "#date"
date_type: "date"
filter_cols:
- "#country+code"
input:
- "#total+vaccinations"
population_key: "global"
process:
- "#total+vaccinations"
- "number_format((#total+vaccinations / 2) / #population)"
output:
- "TotalDosesAdministered"
- "PopulationCoverageAdministeredDoses"
output_hxl:
- "#capacity+doses+administered+total"
- "#capacity+doses+administered+coverage+pct"
If columns need to be summed and the latest date chosen overall not per admin unit, then
we can specify single_maxdate
as shown below. Also in this example, the source
information for CBPF is taken from a different dataset to CERF even though the data url
remains the same:
cerf_global:
dataset:
"#value+cbpf+funding+total+usd": "cbpf-allocations-and-contributions"
...
default_dataset: "cerf-covid-19-allocations"
url: "tests/fixtures/full_pfmb_allocations.csv"
format: "csv"
force_date_today: True
headers: 1
date: "AllocationYear"
date_type: "year"
single_maxdate: True
filter_cols:
- "FundType"
- "GenderMarker"
subsets:
...
- filter: "FundType == 'CBPF' and GenderMarker == '0'"
input:
- "Budget"
transform:
Budget: "float(Budget)"
sum:
- formula: "Budget"
output:
- "CBPFFundingGM0"
output_hxl:
- "#value+cbpf+funding+gm0+total+usd"
...
In the example below, should_overwrite_sources
ensures that the sources generated
by this configurable scraper overwrite any sources generated previously with the
same HXL hashtags.
idps_ethiopia:
dataset: "ethiopia-drought-induced-displacement"
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vRppQx8JTKkKRCKmzfnCMmTFEcvCpkbP9PdHs1sQTUyacmbsx8tlAXpgBLFce-lcehukreGGuXjA_4S/pub?gid=961087049&single=true&output=csv"
format: "csv"
use_hxl: True
should_overwrite_sources: True
Population Data
Population data is treated as a special class of data. By default, configurable and
custom scrapers detect population data by looking for the output HXL hashtag
#population
and add it to a dictionary population_lookup
that is a variable of the
BaseScraper
class and hence accessible to all scrapers. For most data, the admin
names will be taken from the data. For top level data, the admin name is taken from the
level name unless population_key
is defined in which case the value in there will be
used instead.
For configurable scrapers where columns are evaluated (rather than assigned), it is
possible to use #population
and the appropriate population value for the
administrative unit will be substituted automatically. Where output is a single value,
for example where working on global data, then population_key
must be specified both
for any configurable scraper that outputs a single population value and for any
configurable scraper that needs that single population value. population_key
defines
what key will be used with population_lookup
.
When using subsets, the rule is when reading frompopulation_lookup
, define
population_key
in the subset and when writing to it, define population_key
at the
top level of the configuration.
Output Specification and Setup
The YAML configuration should have a mapping from the internal dictionaries to the tabs in the spreadsheet / keys in the JSON output file(s):
tabs:
world: "WorldData"
regional: "RegionalData"
national: "NationalData"
subnational: "SubnationalData"
covid_series: "CovidSeries"
covid_trend: "CovidTrend"
sources: "Sources"
Then the location of Google spreadsheets are defined, for prod (production), test and scratch:
googlesheets:
prod: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_PROD/edit"
test: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_TEST/edit"
scratch: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_SCRATCH/edit"
The json outputs are then specified:
json:
additional_inputs:
- name: "Other"
source: "Some org"
source_url: "https://data.humdata.org/organization/world-bank-group"
format: "json"
url: "https://raw.githubusercontent.com/mcarans/hdx-python-scraper/master/tests/fixtures/additional_json.json"
jsonpath: "[*]"
output: "test_tabular_all.json"
additional_outputs:
- filepath: "test_tabular_population.json"
tabs:
- tab: "national"
key: "cumulative"
filters:
"#country+code": "{{countries_to_save}}"
output:
- "#country+code"
- "#country+name"
- "#population"
- filepath: "test_tabular_population_2.json"
tabs:
- tab: "national"
key: "cumulative"
filters:
"#country+code":
- "AFG"
output:
- "#country+code"
- "#country+name"
- "#population"
- filepath: "test_tabular_other.json"
remove:
- "national"
The key additional_inputs
defines JSON files to be downloaded and added to the JSON
under the appropriate key (eg. Other
in the example configuration above). Source
information is added as well.
The key output
contains the path of the output JSON file.
Under the key additional_outputs
, subsets of the full JSON output can be saved as
separate files. For each additional output, filepath
defines the path of the output
cut down JSON file. A subset of each input tab
in tabs
is output under key
.
filters
can be applied for example to restrict to a set of country codes which can be
given in the configuration or passed in as variables to the save
call (see below).
The HXL hashtags to be outputted (ie. the columns) are defined under output
. If
remove
is supplied instead of tabs
then all of the data in the full JSON file is
outputted except for the tabs defined under remove
.
The code to go with the configuration is as follows:
excelout = ExcelFile(excel_path, tabs, updatetabs)
gsheets = GoogleSheets(gsheet_config, gsheet_auth, updatesheets, tabs, updatetabs)
jsonout = JsonFile(json_config, updatetabs)
outputs = {"gsheets": gsheets, "excel": excelout, "json": jsonout}
...
writer = Writer(runner, outputs)
writer.update_subnational(scraper_names, adminlevel)
...
excelout.save()
filepaths = jsonout.save(tempdir, countries_to_save=["AFG"])
Output from the scrapers can go to Excel, Google Sheets and/or a JSON file.
The Writer class has a constructor which takes two parameters. The first parameter is
the Runner object. The second is a dictionary of outputs such as to Google Sheets, Excel
or JSON. There are standard methods for updating level data in the Writer class including
update_toplevel
("global" for example), update_regional
, update_national
and
update_subnational
.
update_national
is set up as follows:
flag_countries = {
"header": "ishrp",
"hxltag": "#meta+ishrp",
"countries": hrp_countries,
}
writer.update_national(
gho_countries,
names=national_names,
flag_countries=flag_countries,
iso3_to_region=RegionLookup.iso3_to_regions["GHO"],
ignore_regions=("GHO",),
level="national",
tab="national",
)
The first parameter is a list of country ISO 3 codes. The second optional parameter is
the names
of the scrapers to include. The third optional parameter, flag_countries
,
is a dictionary where the "header" and "hxltag" keys define an additional column to
output which can contain "Y" or "N" depending upon whether the country is one of those
defined in the key "countries". The fourth optional parameter, iso3_to_region
, is a
dictionary which maps from ISO 3 country code to a set of regions and it will result in
an additional column (with header "Region" and HXL hashtag "#region+name") in which
regions are listed separated by "|". The fifth optional ignore_regions
parameter
defines a list of regions that should not be output in the "Region" column. The sixth
parameter level
defaults to "national" and is the key to use when obtaining the
results of the scrapers. The last parameter is the tab
to update and defaults to
"national".
update_subnational
is used as follows:
update_subnational(
adminlevel,
names=subnational_names,
level="subnational",
tab="subnational",
)
The first parameter is an AdminLevel object (described earlier). The second optional
parameter is the names
of the scrapers to include. The parameter level
defaults to
"subnational" and is the key to use when obtaining the results of the scrapers. The last
parameter is the tab
to update and defaults to "subnational".
update_toplevel
and update_regional
require the output from two other functions.
regional_rows = get_regional_rows(
runner,
RegionLookup.regions + ["global"],
names=regional_names,
level="regional",
)
global_rows = get_toplevel_rows(
runner,
names=global_names,
overrides={"who_covid": {"gho": "global"}},
toplevel="global",
)
The first parameter of get_regional_rows
is a list of regions. The second optional
parameter is the names
of the scrapers to include. The parameter level
defaults to
"regional" and is the key to use when obtaining the results of the scrapers.
The first optional parameter of get_toplevel_rows
is the names
of the scrapers to
include. The second optional parameter overrides
defines levels to get for specific
scrapers, for example for "who_covid", output the data for the level "gho" as "global".
The last parameter toplevel
defaults to "allregions" and is the key to use when
obtaining the results of the scrapers.
update_regional(
regional_rows,
toplevel_rows=global_rows,
toplevel_hxltags=additional_global_hxltags,
toplevel="global",
)
update_toplevel(
global_rows,
tab="world",
regional_rows=regional_rows,
regional_adm="GHO",
regional_hxltags=configuration["regional"]["global"],
regional_first=False,
)
The first parameter of update_regional
is the regional rows obtained from
get_regional_rows
. The second optional parameter is the toplevel_rows
obtained from
get_toplevel_rows
. The third optional parameter, toplevel_hxltags
specifies top
level data to include. It will correspond to one row in the regional output. The last
parameter toplevel
defaults to "allregions" and is used as the region name.
The first parameter of update_toplevel
is the toplevel_rows
obtained from
get_toplevel_rows
. The second optional parameter is the tab
to update and defaults
to "allregions". The third optional parameter is the regional_rows
(obtained from
get_regional_rows
) from which data for the admin given by the fourth optional
parameter regional_adm
is extracted. The specific regional columns to include is given
by the fifth optional parameter regional_hxltags
. The last parameter regional_first
which defaults to False
specifies whether columns from regional data are put in front
of columns from top level data.
Sources
Default Source Behaviour
The default source date format is "%b %-d, %Y". The default date range separator (where start and end dates are supplied) is "-". By default, a source created with the same HXL hashtag as one created earlier will not overwrite the earlier source.
These defaults can be changed by calling the following methods in the
Sources
class:
Sources.set_default_source_date_format(new_format)
Sources.set_default_date_range_separator(new_separator)
Sources.set_should_overwrite_sources(True)
Source Configurations
Custom scrapers can be set up with a custom source configuration by calling the
parent (BaseScraper
) constructor and passing source_configuration
:
super().__init__(
"ipc",
datasetinfo,
{
"adminone": ((p3plus_header,), (self.p3plus_hxltag,)),
"admintwo": ((p3plus_header,), (self.p3plus_hxltag,)),
},
source_configuration=Sources.create_source_configuration(
adminlevel=(adminone, admintwo), should_overwrite_sources=True
),
)
adminlevel
specifies to objects of type AdminLevel
that map admin units to
country iso3s, so that sources will be generated per country ie. each HXL hashtag will
be output with an additional attribute of the form "+KEN".
should_overwrite_sources
being True
means that the sources generated by this
custom scraper will overwrite any previous sources generated elsewhere with the
same HXL hashtags.
Configurable scrapers can also be set up with a custom source configuration:
def create_configurable_scrapers(level, suffix_attribute=None, adminlevel=None):
suffix = f"_{level}"
source_configuration = Sources.create_source_configuration(
suffix_attribute=suffix_attribute, admin_sources=True, adminlevel=adminlevel
)
configurable_scrapers[level] = runner.add_configurables(
configuration[f"scraper{suffix}"],
level,
adminlevel=adminlevel,
source_configuration=source_configuration,
suffix=suffix,
)
create_configurable_scrapers("regional", suffix_attribute="regional")
create_configurable_scrapers("national")
create_configurable_scrapers("adminone", adminlevel=adminone)
create_configurable_scrapers("admintwo", adminlevel=admintwo)
In the above case, there are configurable scrapers at regional, national, admin 1 and
admin 2 levels. The regional sources will have additional HXL attribute "+regional",
The national sources will be per country since admin_sources
is set to True
in the
Sources.create_source_configuration
call so hashtags will have attributes like "+ken".
The admin 1 and admin 2 sources will also have country attributes like "+ken" with the
mapping from admin units to country provided in adminlevel
.
Updating sources
There is an update_sources
method in the Writer class to update source information and
it is used as follows:
writer.update_sources(
additional_sources=configuration["additional_sources"],
names=scraper_names,
secondary_runner=None,
custom_sources=(get_report_source(configuration),),
tab="sources",
should_overwrite_sources=False,
sources_to_delete=configuration["delete_sources"],
)
The first optional parameter additional_sources
enables additional sources to be
declared according to a specification eg. from YAML:
additional_sources:
- indicator: "#food-prices"
dataset: "global-wfp-food-prices"
- indicator: "#affected+food+p3plus+num"
source: "Multiple sources"
force_date_today: True
source_url: "https://data.humdata.org/search?q=(name:ipc-country-data%20OR%20name:cadre-harmonise)"
- indicator: "#affected+food+ipc+p3plus+num+som"
source_date:
start: "01/10/2022"
end: "31/12/2022"
source: "IPCInfo"
source_url: "https://data.humdata.org/dataset/somalia-acute-food-insecurity-country-data"
source_date_format:
start: "%b"
separator: "-"
end: "%b %Y (First projection)"
should_overwrite_source: True
- indicator: "#affected+idps+som"
copy: "#affected+idps+ind+som"
source_url: "https://data.humdata.org/dataset/somalia-drought-related-key-figures"
should_overwrite_source: True
This allows additional HXL hashtags to be associated with a source date, source and
url. The metadata for "#food-prices" is obtained from a dataset on HDX, while for
"#affected+food+p3plus+num", it is all specified with the source date being set to
today. "#affected+food+ipc+p3plus+num+som" shows a more complex setup with a date
range outputted for the source date. should_overwrite_source
allows any existing
source to be overridden.
For "#affected+idps+som", it is copied from "#affected+idps+ind+som" with source_url
being replaced. It is also possible for indicator
and copy
to refer to the same
HXL hashtag, meaning that the source will be overridden provided
should_overwrite_source
is True.
The second optional names
parameter allows the specific scrapers for which
sources are to be output to be chosen by name. The third optional parameter
secondary_runner
allows a second Runner object to be supplied and the sources
from the scrapers associated with that Runner object to be included. The fourth
optional parameter custom_sources
allows sources that have been obtained for
example from a function call to be added directly without any processing or
changes to them. They should be in the form: (HXL tag, source date, source,
source url). The fifth parameter is the tab
to update and defaults to
"sources". By default, if the same indicator (HXL hashtag) appears more than
once in the list of sources, then the first is used, but there is the sixth
parameter should_overwrite_sources
enables overwriting of sources instead.
The last parameter sources_to_delete
is a list of sources to delete where
each source is specified using its HXL hashtag (or a part of the hashtag).
For more fine-grained control, it is also possible to obtain sources by calling
get_sources
on a Runner object with or without the optional additional_sources
parameter:
runner.get_sources(additional_sources=[...])
Other Configurable Scrapers
Some other configurable scrapers are provided for specific tasks. If running specific
scrapers rather than all, and you want to force the inclusion of the scraper in the run
regardless of the specific scrapers given, the final parameter force_add_to_run
in the
add_*
call of the Runner object should be set to True.
Time Series Scraper
This scraper reads and outputs time series data. One or more instances can be set up as follows:
runner.add_timeseries_scrapers(configuration["timeseries"], outputs)
The first parameter defines the YAML configuration where the scrapers are configured as
shown below. The second is a dictionary where the values are objects that inherit from
BaseOutput
and which serve to send output somewhere like Google Sheets.
casualties:
source: "OHCHR"
source_url: "https://data.humdata.org/dataset/ukraine-key-figures-2022"
dataset: "ukraine-who-does-what-where-3w"
url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vQIdedbZz0ehRC0b4fsWiP14R7MdtU1mpmwAkuXUPElSah2AWCURKGALFDuHjvyJUL8vzZAt3R1B5qg/pub?gid=0&single=true&output=csv"
format: "csv"
headers: 2
date: "Date"
date_type: "date"
date_hxl: "#date"
input:
- "Civilian casualities(OHCHR) - Killed"
- "Civilian casualities(OHCHR) - Injured"
output:
- "CiviliansKilled"
- "CiviliansInjured"
output_hxl:
- "#affected+killed"
- "#affected+injured"
This reads from the given url looking for the date column given by date
. The given
input
columns are output along with the date using the column names given by output
and the HXL hashtags given by output_hxl
.
Aggregator
The aggregator scraper is used for aggregating data from other scrapers. One or more are set up as shown below:
regional_names_gho = runner.add_aggregators(
True,
regional_configuration["aggregate_gho"],
"national",
"regional",
RegionLookup.iso3_to_regions["GHO"],
force_add_to_run=True
)
The first parameter is whether to use columns headers (False) or HXL hash tags (True).
The second points to a YAML configuration which is outlined below. The third is the
level of the input scraper data to be aggregated (like national). The fourth is the
level of the aggregated output data (like regional). The fifth is a mapping from admin
units of the input level to admin units of the output level of form:
{"AFG": ("ROAP",), "MMR": ("ROAP",)}
. If the mapping is to the top level, then it is
a list of input admins like: ("AFG", "MMR")
.
aggregate_gho:
"#population":
action: "sum"
"#affected+infected":
action: "sum"
...
"#value+funding+hrp+required+usd":
output: "RequiredFunding"
action: "sum"
"#value+funding+hrp+total+usd":
output: "Funding"
action: "sum"
"#value+funding+hrp+pct":
output: "PercentFunded"
action: "eval"
formula: "get_fraction_str(#value+funding+hrp+total+usd, #value+funding+hrp+required+usd) if #value+funding+hrp+total+usd is not None and #value+funding+hrp+required+usd is not None else ''"
"#access+visas+pct":
action: "mean"
"#access+travel+pct":
action: "mean"
...
"#affected+food+ipc+p3plus+num":
output: "FoodInsecurityP3+"
action: "sum"
input:
- "#affected+ch+food+p3plus+num"
- "#affected+food+ipc+p3plus+num"
The configuration lists input HXL tags along with what sort of aggregation will be
performed ("sum", "mean" or "eval" under action
). "eval" allows combining already
aggregated columns together. Where input comes from multiple columns, these can be
defined with input
and the output column name with output
.
"#affected+idps":
source: "IOM, UNHCR, PRMN"
source_url: "https://data.humdata.org/dataset?groups=eth&groups=ken&groups=som&organization=ocha-rosea&vocab_Topics=drought&q=&sort=score%20desc%2C%20if(gt(last_modified%2Creview_date)%2Clast_modified%2Creview_date)%20desc&ext_page_size=25"
action: "sum"
Source information such as source_date
, source
and/or source_url
, can be
overridden as shown above.
Resource Downloader
The resource downloader is a simple scraper that downloads resources from HDX datasets. One or more is set up as follows:
res_dlds = runner.add_resourcedownloaders(configuration["copyfiles"], folder)
The first parameter is a YAML configuration shown below. The second is the folder to which to download (which defaults to "").
download_resources:
- dataset: "ukraine-border-crossings"
format: "geojson"
filename: "UKR_Border_Crossings.geojson"
hxltag: "#geojson"
- dataset: "ukraine-hostilities"
format: "geojson"
filename: "UKR_Hostilities.geojson"
hxltag: "#event+loc"
The dataset
from which to copy is specified along with the format
of the resource to
be copied. The output filename
is given along with a hxltag
that is used in the
reporting of sources (with source information taken from the HDX dataset).
Other Utilities
Region Lookup
A class is provided that allows creating lookups from ISO 3 country codes to regions. It is set up like this:
RegionLookup.load(regional_configuration, gho_countries, {"HRPs": hrp_countries})
The configuration comes from a YAML file shown below. The second parameter is a list of countries. The third is a dictionary containing additional regions.
regional:
dataset: "unocha-office-locations"
format: "xlsx"
iso3_header: "ISO3"
region_header: "Regional_office"
toplevel_region: "GHO"
ignore:
- "NO COVERAGE"
The configuration above reads from a dataset
from HDX, looking for a resource of
format
"xlsx". In that file, it uses columns specified by iso3_header
and
regional_header
. Regions in the ignore
list are not included. A country can map to
not only what is specified in the dataset but also to toplevel_region
(eg. GHO) which
covers all countries given by the second parameter of the load
call (eg.
gho_countries) and to one or more additional regions given in the optional third
parameter (eg. "HRPs"). For the third parameter, each key value pair is a mapping from a
region name to a list of countries in that region.
RegionLookup provides class variables regions
(list of regions) and iso3_to_region
(one-to-one mapping from country ISO3 code to region name based purely on the dataset
read from the configuration). It also provides iso3_to_regions
which is a one-to-many
mapping from country ISO3 code to multiple region names which will include the
toplevel_region
and additional regions specified in the third parameter.
Real World Usage
This framework has been used to power the data behind a few visualisations. It can be helpful to examine these to see the framework being used in a complete setup.
The project here provides data for the Covid Data Explorer.
The project here provides data for the Ukraine Data Explorer.