hdx.scraper.runner
Runner Objects
class Runner()
The Runner class is the means by which scrapers are set up and run.
Arguments:
countryiso3s
ListTuple[str] - List of ISO3 country codes to processtoday
datetime - Value to use for today. Defaults to now_utc().errors_on_exit
ErrorsOnExit - ErrorsOnExit object that logs errors on exitscrapers_to_run
Optional[ListTuple[str]] - Scrapers to run. Defaults to None (all scrapers).
add_custom
def add_custom(scraper: BaseScraper, force_add_to_run: bool = False) -> str
Add custom scrapers that inherit BaseScraper. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
scraper
BaseScraper - The scraper to addforce_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
str
- scraper name
add_customs
def add_customs(scrapers: ListTuple[BaseScraper],
force_add_to_run: bool = False) -> List[str]
Add multiple custom scrapers that inherit BaseScraper. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
scrapers
ListTuple[BaseScraper] - The scrapers to addforce_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
str
- scraper name
add_configurable
def add_configurable(name: str,
datasetinfo: Dict,
level: str,
adminlevel: Optional[AdminLevel] = None,
level_name: Optional[str] = None,
source_configuration: Dict = {},
suffix: Optional[str] = None,
force_add_to_run: bool = False,
countryiso3s: Optional[List[str]] = None) -> str
Add configurable scraper to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
name
str - Name of scraperdatasetinfo
Dict - Information about datasetlevel
str - Can be national, subnational or singleadminlevel
Optional[AdminLevel] - AdminLevel object from HDX Python Country. Defaults to None.level_name
Optional[str] - Customised level_name name. Defaults to None (level_name).source_configuration
Dict - Configuration for sources. Defaults to empty dict (use defaults).suffix
Optional[str] - Suffix to add to the scraper nameforce_add_to_run
bool - Whether to force include the scraper in the next runcountryiso3s
Optional[List[str]] - Override list of country iso3s. Defaults to None.
Returns:
str
- scraper name (including suffix if set)
add_configurables
def add_configurables(configuration: Dict,
level: str,
adminlevel: Optional[AdminLevel] = None,
level_name: Optional[str] = None,
source_configuration: Dict = {},
suffix: Optional[str] = None,
force_add_to_run: bool = False,
countryiso3s: Optional[List[str]] = None) -> List[str]
Add multiple configurable scrapers to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
configuration
Dict - Mapping from scraper name to information about datasetslevel
str - Can be national, subnational or singleadminlevel
Optional[AdminLevel] - AdminLevel object from HDX Python Country. Defaults to None.level_name
Optional[str] - Customised level_name name. Defaults to None (level_name).source_configuration
Dict - Configuration for sources. Defaults to empty dict (use defaults).suffix
Optional[str] - Suffix to add to the scraper nameforce_add_to_run
bool - Whether to force include the scraper in the next runcountryiso3s
Optional[List[str]] - Override list of country iso3s. Defaults to None.
Returns:
List[str]
- scraper names (including suffix if set)
add_timeseries_scraper
def add_timeseries_scraper(name: str,
datasetinfo: Dict,
outputs: Dict[str, BaseOutput],
force_add_to_run: bool = False) -> str
Add time series scraper to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
name
str - Name of scraperdatasetinfo
Dict - Information about datasetoutputs
Dict[str, BaseOutput] - Mapping from names to output objectsforce_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
str
- scraper name (including suffix if set)
add_timeseries_scrapers
def add_timeseries_scrapers(configuration: Dict,
outputs: Dict[str, BaseOutput],
force_add_to_run: bool = False) -> List[str]
Add multiple time series scrapers to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
configuration
Dict - Mapping from scraper name to information about datasetsoutputs
Dict[str, BaseOutput] - Mapping from names to output objectsforce_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
List[str]
- scraper names (including suffix if set)
create_aggregator
def create_aggregator(
use_hxl: bool,
header_or_hxltag: str,
datasetinfo: Dict,
input_level: str,
output_level: str,
adm_aggregation: Union[Dict, List],
source_configuration: Dict = {},
names: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {},
aggregation_scrapers: List["Aggregator"] = []
) -> Optional["Aggregator"]
Create aggregator
Arguments:
use_hxl
bool - Whether keys should be HXL hashtags or column headersheader_or_hxltag
str - Column header or HXL hashtag depending on use_hxldatasetinfo
Dict - Information about datasetinput_level
str - Input level to aggregate like national or subnationaloutput_level
str - Output level of aggregated data like regionaladm_aggregation
Union[Dict, List] - Mapping from input admins to aggregated output adminssource_configuration
Dict - Configuration for sources. Defaults to empty dict (use defaults).names
Optional[ListTuple[str]] - Names of scrapers. Defaults to None.overrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.aggregation_scrapers
List["Aggregator"] - Other aggregations needed. Defaults to [].
Returns:
Optional["Aggregator"]
- scraper or None
add_aggregator
def add_aggregator(use_hxl: bool,
header_or_hxltag: str,
datasetinfo: Dict,
input_level: str,
output_level: str,
adm_aggregation: Union[Dict, List],
source_configuration: Dict = {},
names: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {},
aggregation_scrapers: List["Aggregator"] = [],
force_add_to_run: bool = False) -> Optional[str]
Add aggregator to the run. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
use_hxl
bool - Whether keys should be HXL hashtags or column headersheader_or_hxltag
str - Column header or HXL hashtag depending on use_hxldatasetinfo
Dict - Information about datasetinput_level
str - Input level to aggregate like national or subnationaloutput_level
str - Output level of aggregated data like regionaladm_aggregation
Union[Dict, List] - Mapping from input admins to aggregated output adminssource_configuration
Dict - Configuration for sources. Defaults to empty dict (use defaults).names
Optional[ListTuple[str]] - Names of scrapers. Defaults to None.overrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.aggregation_scrapers
List["Aggregator"] - Other aggregations needed. Defaults to [].force_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
Optional[str]
- scraper name (including suffix if set) or None
add_aggregators
def add_aggregators(use_hxl: bool,
configuration: Dict,
input_level: str,
output_level: str,
adm_aggregation: Union[Dict, ListTuple],
source_configuration: Dict = {},
names: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {},
force_add_to_run: bool = False) -> List[str]
Add multiple aggregators to the run. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
use_hxl
bool - Whether keys should be HXL hashtags or column headersconfiguration
Dict - Mapping from scraper name to information about datasetsinput_level
str - Input level to aggregate like national or subnationaloutput_level
str - Output level of aggregated data like regionaladm_aggregation
Union[Dict, ListTuple] - Mapping from input admins to aggregated output adminssource_configuration
Dict - Configuration for sources. Defaults to empty dict (use defaults).names
Optional[ListTuple[str]] - Names of scrapersoverrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.force_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
List[str]
- scraper names (including suffix if set)
add_resource_downloader
def add_resource_downloader(datasetinfo: Dict,
folder: str = "",
force_add_to_run: bool = False) -> str
Add resource downloader to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
datasetinfo
Dict - Information about datasetfolder
str - Folder to which to download. Defaults to "".force_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
str
- scraper name (including suffix if set)
add_resource_downloaders
def add_resource_downloaders(configuration: Dict,
folder: str = "",
force_add_to_run: bool = False) -> List[str]
Add multiple resource downloaders to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.
Arguments:
configuration
Dict - Mapping from scraper name to information about datasetsfolder
str - Folder to which to download. Defaults to "".force_add_to_run
bool - Whether to force include the scraper in the next run
Returns:
List[str]
- scraper names (including suffix if set)
prioritise_scrapers
def prioritise_scrapers(scraper_names: ListTuple[str]) -> None
Set certain scrapers to run first
Arguments:
scraper_names
ListTuple[str] - Names of scrapers to run first
Returns:
None
get_scraper_names
def get_scraper_names() -> List[str]
Get names of scrapers
Returns:
List[str]
- Names of scrapers
get_scraper
def get_scraper(name: str) -> BaseScraper
Get scraper given name
Arguments:
name
str - Name of scraper
Returns:
Optional[BaseScraper]
- Scraper or None if there is no scraper with given name
get_scraper_exception
def get_scraper_exception(name: str) -> BaseScraper
Get scraper given name. Throws exception if there is no scraper with the given name.
Arguments:
name
str - Name of scraper
Returns:
BaseScraper
- Scraper
delete_scraper
def delete_scraper(name: str) -> bool
Delete scraper with given name
Arguments:
name
str - Name of scraper
Returns:
bool
- True if the scraper was present, False if not
add_instance_variables
def add_instance_variables(name: str, **kwargs: Any) -> None
Add instance variables to scraper instance given scraper name
Arguments:
name
str - Name of scraper**kwargs
- Instance name value pairs to add to scraper instance
Returns:
None
add_pre_run
def add_pre_run(name: str, fn: Callable[[BaseScraper], None]) -> None
Add pre run instance method to scraper instance given scraper name. The function should have one parameter. Since it is being added as an instance method to the scraper instance, that parameter will be self and hence is of type BaseScraper. The function does not need to return anything.
Arguments:
name
str - Name of scraperfn
Callable[[BaseScraper], None] - Function to call pre run
Returns:
None
add_post_run
def add_post_run(name: str, fn: Callable[[BaseScraper], None]) -> None
Add post run instance method to scraper instance given scraper name. The function should have one parameter. Since it is being added as an instance method to the scraper instance, that parameter will be self and hence is of type BaseScraper. The function does not need to return anything.
Arguments:
name
str - Name of scraperfn
Callable[[BaseScraper], None] - Function to call post run
Returns:
None
run_one
def run_one(name: str, force_run: bool = False) -> bool
Run scraper with given name, adding sources and population to global dictionary. If scraper run fails and fallbacks have been set up, use them.
Arguments:
name
str - Name of scraperforce_run
bool - Force run even if scraper marked as already run
Returns:
bool
- Return True if scraper was run, False if not
run_scraper
def run_scraper(name: str, force_run: bool = False) -> bool
Check scraper with given name is in the list of scrapers to run. If it isn't, return False, otherwise run it (including force running scrapers that have already run if force_run is True), adding sources and population to global dictionary. If scraper run fails and fallbacks have been set up, use them.
Arguments:
name
str - Name of scraperforce_run
bool - Force run even if scraper marked as already run
Returns:
bool
- Return True if scraper was run, False if not
run
def run(what_to_run: Optional[ListTuple[str]] = None,
force_run: bool = False,
prioritise_scrapers: Optional[ListTuple[str]] = None) -> None
Run scrapers limiting to those in what_to_run if given (including force running scrapers that have already run if force_run is True), adding sources and population to global dictionary. Scrapers given by prioritise_scrapers are run first. If scraper run fails and fallbacks have been set up, use them.
Arguments:
what_to_run
Optional[ListTuple[str]] - Run only these scrapers. Defaults to None (run all).force_run
bool - Force run even if any scraper marked as already runprioritise_scrapers
Optional[ListTuple[str]] - Scrapers to run first. Defaults to None.
Returns:
None
set_not_run
def set_not_run(name: str) -> None
Set scraper given by name as not run
Arguments:
name
str - Name of scraper
Returns:
None
set_not_run_many
def set_not_run_many(names: ListTuple[str]) -> None
Set scrapers given by names as not run
Arguments:
names
ListTuple[str] - Names of scraper
Returns:
None
get_headers
def get_headers(names: Optional[ListTuple[str]] = None,
levels: Optional[ListTuple[str]] = None,
headers: Optional[ListTuple[str]] = None,
hxltags: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {}) -> Dict[str, Tuple]
Get the headers for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. The dictionary returned can be limited to given headers or hxltags.
Arguments:
names
Optional[ListTuple[str]] - Names of scraperlevels
Optional[ListTuple[str]] - Levels to get like national, subnational or singleheaders
Optional[ListTuple[str]] - Headers to gethxltags
Optional[ListTuple[str]] - HXL hashtags to getoverrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
Returns:
Dict[str, Tuple]: Dictionary that maps each level to (list of headers, list of hxltags)
get_results
def get_results(
names: Optional[ListTuple[str]] = None,
levels: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {},
has_run: bool = True,
should_overwrite_sources: Optional[bool] = None) -> Dict[str, Dict]
Get the results (headers, values and sources) for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels. By default only scrapers marked as having run are returned unless has_run is set to False. The results dictionary has keys for each output level and values which are dictionaries with keys headers, values, sources and fallbacks. Headers is a tuple of (column headers, hxl hashtags). Values, sources and fallbacks are all lists.
Arguments:
names
Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).levels
Optional[ListTuple[str]] - Levels to get like national, subnational or singleoverrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.has_run
bool - Only get results for scrapers marked as having run. Defaults to True.should_overwrite_sources
Optional[bool] - Whether to overwrite sources. Defaults to None (class default).
Returns:
Dict[str, Dict]: Results dictionary that maps each level to headers, values, sources, fallbacks.
get_rows
def get_rows(level: str,
adms: ListTuple[str],
headers: ListTuple[ListTuple] = (tuple(), tuple()),
row_fns: ListTuple[Callable[[str], str]] = tuple(),
names: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {}) -> List[List]
Get rows for a given level for scrapers limiting to those in names if given. Rows include header row, HXL hashtag row and value rows, one for each admin unit specified in the adms parameter. Additional columns can be included by specifying headers and row_fns. Headers are of the form (list of headers, list of HXL hashtags). row_fns are functions that accept an admin unit and return a string. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels.
Arguments:
level
str - Level to get like national, subnational or singleadms
ListTuple[str] - Admin unitsheaders
ListTuple[ListTuple] - Additional headers in the form (list of headers, list of HXL hashtags)row_fns
ListTuple[Callable[[str], str]] - Functions to populate additional columnsnames
Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).overrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
Returns:
List[List]
- Rows for a given level
get_values_sourcesinfo_by_header
def get_values_sourcesinfo_by_header(
level: str,
names: Optional[ListTuple[str]] = None,
overrides: Dict[str, Dict] = {},
has_run: bool = True,
use_hxl: bool = True) -> Tuple[Dict, Dict]
Get mapping from headers to values and headers to sources information for a given level for scrapers limiting to those in names if given. Keys will be headers if use_hxl is False or HXL hashtags if use_hxl is True. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels.
Arguments:
level
str - Level to get like national, subnational or singlenames
Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).overrides
Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.has_run
bool - Only get results for scrapers marked as having run. Defaults to True.use_hxl
bool - Whether keys should be HXL hashtags or column headers. Defaults to True.
Returns:
Tuple[Dict, Dict]: Tuple of (headers to values, headers to sources)
get_sources
def get_sources(
names: Optional[ListTuple[str]] = None,
levels: Optional[ListTuple[str]] = None,
additional_sources: ListTuple[Dict] = tuple(),
should_overwrite_sources: Optional[bool] = None) -> List[Tuple]
Get sources for scrapers limiting to those in names if given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. Additional sources can be added. Each is a dictionary with indicator (specified with HXL hash tag), dataset or source and source_url as well as the source_date or whether to force_date_today.
Arguments:
names
Optional[ListTuple[str]] - Names of scraperslevels
Optional[ListTuple[str]] - Levels to get like national, subnational or singleadditional_sources
ListTuple[Dict] - Additional sources to addshould_overwrite_sources
Optional[bool] - Whether to overwrite sources. Defaults to None (class default).
Returns:
List[Tuple]
- Sources in form (indicator, date, source, source_url)
get_source_urls
def get_source_urls(names: Optional[ListTuple[str]] = None) -> List[str]
Get source urls for scrapers limiting to those in names if given.
Arguments:
names
Optional[ListTuple[str]] - Names of scrapers
Returns:
List[str]
- List of source urls
get_hapi_metadata
def get_hapi_metadata(names: Optional[ListTuple[str]] = None,
has_run: bool = True) -> Dict
Get HAPI metadata for all datasets. A dictionary is returned that maps from dataset ids to a dictionary. The dictionary has keys for dataset metadata and a key resources under which is a dictionary that maps from resource ids to resource metadata.
Arguments:
names
Optional[ListTuple[str]] - Names of scrapershas_run
bool - Only get results for scrapers marked as having run. Defaults to True.
Returns:
Dict
- HAPI metadata for all datasets
get_hapi_results
def get_hapi_results(names: Optional[ListTuple[str]] = None,
has_run: bool = True) -> Dict
Get the results (headers and values per admin level and HAPI metadata) for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. By default, only scrapers marked as having run are returned unless has_run is set to False.
A dictionary is returned where key is HDX dataset id and value is a dictionary that has HAPI dataset metadata as well as a results key. The value associated with the results key is a dictionary where each key is an admin level. Each admin level key has a value dictionary with headers, values and HAPI resource metadata. Headers is a tuple of (column headers, hxl hashtags). Values is a list. HAPI resource metadata is a dictionary.
Arguments:
names
Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).has_run
bool - Only get results for scrapers marked as having run. Defaults to True.
Returns:
Dict
- Headers and values per admin level and HAPI metadata for all datasets