hdx.scraper.runner

Runner Objects

class Runner()

[view_source]

The Runner class is the means by which scrapers are set up and run.

Arguments:

  • countryiso3s ListTuple[str] - List of ISO3 country codes to process
  • today datetime - Value to use for today. Defaults to now_utc().
  • errors_on_exit ErrorsOnExit - ErrorsOnExit object that logs errors on exit
  • scrapers_to_run Optional[ListTuple[str]] - Scrapers to run. Defaults to None (all scrapers).

add_custom

def add_custom(scraper: BaseScraper, force_add_to_run: bool = False) -> str

[view_source]

Add custom scrapers that inherit BaseScraper. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • scraper BaseScraper - The scraper to add
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • str - scraper name

add_customs

def add_customs(scrapers: ListTuple[BaseScraper],
                force_add_to_run: bool = False) -> List[str]

[view_source]

Add multiple custom scrapers that inherit BaseScraper. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • scrapers ListTuple[BaseScraper] - The scrapers to add
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • str - scraper name

add_configurable

def add_configurable(name: str,
                     datasetinfo: Dict,
                     level: str,
                     adminlevel: Optional[AdminLevel] = None,
                     level_name: Optional[str] = None,
                     source_configuration: Dict = {},
                     suffix: Optional[str] = None,
                     force_add_to_run: bool = False,
                     countryiso3s: Optional[List[str]] = None) -> str

[view_source]

Add configurable scraper to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • name str - Name of scraper
  • datasetinfo Dict - Information about dataset
  • level str - Can be national, subnational or single
  • adminlevel Optional[AdminLevel] - AdminLevel object from HDX Python Country. Defaults to None.
  • level_name Optional[str] - Customised level_name name. Defaults to None (level_name).
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • suffix Optional[str] - Suffix to add to the scraper name
  • force_add_to_run bool - Whether to force include the scraper in the next run
  • countryiso3s Optional[List[str]] - Override list of country iso3s. Defaults to None.

Returns:

  • str - scraper name (including suffix if set)

add_configurables

def add_configurables(configuration: Dict,
                      level: str,
                      adminlevel: Optional[AdminLevel] = None,
                      level_name: Optional[str] = None,
                      source_configuration: Dict = {},
                      suffix: Optional[str] = None,
                      force_add_to_run: bool = False,
                      countryiso3s: Optional[List[str]] = None) -> List[str]

[view_source]

Add multiple configurable scrapers to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • configuration Dict - Mapping from scraper name to information about datasets
  • level str - Can be national, subnational or single
  • adminlevel Optional[AdminLevel] - AdminLevel object from HDX Python Country. Defaults to None.
  • level_name Optional[str] - Customised level_name name. Defaults to None (level_name).
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • suffix Optional[str] - Suffix to add to the scraper name
  • force_add_to_run bool - Whether to force include the scraper in the next run
  • countryiso3s Optional[List[str]] - Override list of country iso3s. Defaults to None.

Returns:

  • List[str] - scraper names (including suffix if set)

add_timeseries_scraper

def add_timeseries_scraper(name: str,
                           datasetinfo: Dict,
                           outputs: Dict[str, BaseOutput],
                           force_add_to_run: bool = False) -> str

[view_source]

Add time series scraper to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • name str - Name of scraper
  • datasetinfo Dict - Information about dataset
  • outputs Dict[str, BaseOutput] - Mapping from names to output objects
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • str - scraper name (including suffix if set)

add_timeseries_scrapers

def add_timeseries_scrapers(configuration: Dict,
                            outputs: Dict[str, BaseOutput],
                            force_add_to_run: bool = False) -> List[str]

[view_source]

Add multiple time series scrapers to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • configuration Dict - Mapping from scraper name to information about datasets
  • outputs Dict[str, BaseOutput] - Mapping from names to output objects
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • List[str] - scraper names (including suffix if set)

create_aggregator

def create_aggregator(
        use_hxl: bool,
        header_or_hxltag: str,
        datasetinfo: Dict,
        input_level: str,
        output_level: str,
        adm_aggregation: Union[Dict, List],
        source_configuration: Dict = {},
        names: Optional[ListTuple[str]] = None,
        overrides: Dict[str, Dict] = {},
        aggregation_scrapers: List["Aggregator"] = []
) -> Optional["Aggregator"]

[view_source]

Create aggregator

Arguments:

  • use_hxl bool - Whether keys should be HXL hashtags or column headers
  • header_or_hxltag str - Column header or HXL hashtag depending on use_hxl
  • datasetinfo Dict - Information about dataset
  • input_level str - Input level to aggregate like national or subnational
  • output_level str - Output level of aggregated data like regional
  • adm_aggregation Union[Dict, List] - Mapping from input admins to aggregated output admins
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None.
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
  • aggregation_scrapers List["Aggregator"] - Other aggregations needed. Defaults to [].

Returns:

  • Optional["Aggregator"] - scraper or None

add_aggregator

def add_aggregator(use_hxl: bool,
                   header_or_hxltag: str,
                   datasetinfo: Dict,
                   input_level: str,
                   output_level: str,
                   adm_aggregation: Union[Dict, List],
                   source_configuration: Dict = {},
                   names: Optional[ListTuple[str]] = None,
                   overrides: Dict[str, Dict] = {},
                   aggregation_scrapers: List["Aggregator"] = [],
                   force_add_to_run: bool = False) -> Optional[str]

[view_source]

Add aggregator to the run. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • use_hxl bool - Whether keys should be HXL hashtags or column headers
  • header_or_hxltag str - Column header or HXL hashtag depending on use_hxl
  • datasetinfo Dict - Information about dataset
  • input_level str - Input level to aggregate like national or subnational
  • output_level str - Output level of aggregated data like regional
  • adm_aggregation Union[Dict, List] - Mapping from input admins to aggregated output admins
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None.
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
  • aggregation_scrapers List["Aggregator"] - Other aggregations needed. Defaults to [].
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • Optional[str] - scraper name (including suffix if set) or None

add_aggregators

def add_aggregators(use_hxl: bool,
                    configuration: Dict,
                    input_level: str,
                    output_level: str,
                    adm_aggregation: Union[Dict, ListTuple],
                    source_configuration: Dict = {},
                    names: Optional[ListTuple[str]] = None,
                    overrides: Dict[str, Dict] = {},
                    force_add_to_run: bool = False) -> List[str]

[view_source]

Add multiple aggregators to the run. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • use_hxl bool - Whether keys should be HXL hashtags or column headers
  • configuration Dict - Mapping from scraper name to information about datasets
  • input_level str - Input level to aggregate like national or subnational
  • output_level str - Output level of aggregated data like regional
  • adm_aggregation Union[Dict, ListTuple] - Mapping from input admins to aggregated output admins
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • names Optional[ListTuple[str]] - Names of scrapers
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • List[str] - scraper names (including suffix if set)

add_resource_downloader

def add_resource_downloader(datasetinfo: Dict,
                            folder: str = "",
                            force_add_to_run: bool = False) -> str

[view_source]

Add resource downloader to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • datasetinfo Dict - Information about dataset
  • folder str - Folder to which to download. Defaults to "".
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • str - scraper name (including suffix if set)

add_resource_downloaders

def add_resource_downloaders(configuration: Dict,
                             folder: str = "",
                             force_add_to_run: bool = False) -> List[str]

[view_source]

Add multiple resource downloaders to the run. If running specific scrapers rather than all, and you want to force the inclusion of the scraper in the run regardless of the specific scrapers given, the parameter force_add_to_run should be set to True.

Arguments:

  • configuration Dict - Mapping from scraper name to information about datasets
  • folder str - Folder to which to download. Defaults to "".
  • force_add_to_run bool - Whether to force include the scraper in the next run

Returns:

  • List[str] - scraper names (including suffix if set)

prioritise_scrapers

def prioritise_scrapers(scraper_names: ListTuple[str]) -> None

[view_source]

Set certain scrapers to run first

Arguments:

  • scraper_names ListTuple[str] - Names of scrapers to run first

Returns:

None

get_scraper_names

def get_scraper_names() -> List[str]

[view_source]

Get names of scrapers

Returns:

  • List[str] - Names of scrapers

get_scraper

def get_scraper(name: str) -> BaseScraper

[view_source]

Get scraper given name

Arguments:

  • name str - Name of scraper

Returns:

  • Optional[BaseScraper] - Scraper or None if there is no scraper with given name

get_scraper_exception

def get_scraper_exception(name: str) -> BaseScraper

[view_source]

Get scraper given name. Throws exception if there is no scraper with the given name.

Arguments:

  • name str - Name of scraper

Returns:

  • BaseScraper - Scraper

delete_scraper

def delete_scraper(name: str) -> bool

[view_source]

Delete scraper with given name

Arguments:

  • name str - Name of scraper

Returns:

  • bool - True if the scraper was present, False if not

add_instance_variables

def add_instance_variables(name: str, **kwargs: Any) -> None

[view_source]

Add instance variables to scraper instance given scraper name

Arguments:

  • name str - Name of scraper
  • **kwargs - Instance name value pairs to add to scraper instance

Returns:

None

add_pre_run

def add_pre_run(name: str, fn: Callable[[BaseScraper], None]) -> None

[view_source]

Add pre run instance method to scraper instance given scraper name. The function should have one parameter. Since it is being added as an instance method to the scraper instance, that parameter will be self and hence is of type BaseScraper. The function does not need to return anything.

Arguments:

  • name str - Name of scraper
  • fn Callable[[BaseScraper], None] - Function to call pre run

Returns:

None

add_post_run

def add_post_run(name: str, fn: Callable[[BaseScraper], None]) -> None

[view_source]

Add post run instance method to scraper instance given scraper name. The function should have one parameter. Since it is being added as an instance method to the scraper instance, that parameter will be self and hence is of type BaseScraper. The function does not need to return anything.

Arguments:

  • name str - Name of scraper
  • fn Callable[[BaseScraper], None] - Function to call post run

Returns:

None

run_one

def run_one(name: str, force_run: bool = False) -> bool

[view_source]

Run scraper with given name, adding sources and population to global dictionary. If scraper run fails and fallbacks have been set up, use them.

Arguments:

  • name str - Name of scraper
  • force_run bool - Force run even if scraper marked as already run

Returns:

  • bool - Return True if scraper was run, False if not

run_scraper

def run_scraper(name: str, force_run: bool = False) -> bool

[view_source]

Check scraper with given name is in the list of scrapers to run. If it isn't, return False, otherwise run it (including force running scrapers that have already run if force_run is True), adding sources and population to global dictionary. If scraper run fails and fallbacks have been set up, use them.

Arguments:

  • name str - Name of scraper
  • force_run bool - Force run even if scraper marked as already run

Returns:

  • bool - Return True if scraper was run, False if not

run

def run(what_to_run: Optional[ListTuple[str]] = None,
        force_run: bool = False,
        prioritise_scrapers: Optional[ListTuple[str]] = None) -> None

[view_source]

Run scrapers limiting to those in what_to_run if given (including force running scrapers that have already run if force_run is True), adding sources and population to global dictionary. Scrapers given by prioritise_scrapers are run first. If scraper run fails and fallbacks have been set up, use them.

Arguments:

  • what_to_run Optional[ListTuple[str]] - Run only these scrapers. Defaults to None (run all).
  • force_run bool - Force run even if any scraper marked as already run
  • prioritise_scrapers Optional[ListTuple[str]] - Scrapers to run first. Defaults to None.

Returns:

None

set_not_run

def set_not_run(name: str) -> None

[view_source]

Set scraper given by name as not run

Arguments:

  • name str - Name of scraper

Returns:

None

set_not_run_many

def set_not_run_many(names: ListTuple[str]) -> None

[view_source]

Set scrapers given by names as not run

Arguments:

  • names ListTuple[str] - Names of scraper

Returns:

None

get_headers

def get_headers(names: Optional[ListTuple[str]] = None,
                levels: Optional[ListTuple[str]] = None,
                headers: Optional[ListTuple[str]] = None,
                hxltags: Optional[ListTuple[str]] = None,
                overrides: Dict[str, Dict] = {}) -> Dict[str, Tuple]

[view_source]

Get the headers for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. The dictionary returned can be limited to given headers or hxltags.

Arguments:

  • names Optional[ListTuple[str]] - Names of scraper
  • levels Optional[ListTuple[str]] - Levels to get like national, subnational or single
  • headers Optional[ListTuple[str]] - Headers to get
  • hxltags Optional[ListTuple[str]] - HXL hashtags to get
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.

Returns:

Dict[str, Tuple]: Dictionary that maps each level to (list of headers, list of hxltags)

get_results

def get_results(
        names: Optional[ListTuple[str]] = None,
        levels: Optional[ListTuple[str]] = None,
        overrides: Dict[str, Dict] = {},
        has_run: bool = True,
        should_overwrite_sources: Optional[bool] = None) -> Dict[str, Dict]

[view_source]

Get the results (headers, values and sources) for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels. By default only scrapers marked as having run are returned unless has_run is set to False. The results dictionary has keys for each output level and values which are dictionaries with keys headers, values, sources and fallbacks. Headers is a tuple of (column headers, hxl hashtags). Values, sources and fallbacks are all lists.

Arguments:

  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).
  • levels Optional[ListTuple[str]] - Levels to get like national, subnational or single
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
  • has_run bool - Only get results for scrapers marked as having run. Defaults to True.
  • should_overwrite_sources Optional[bool] - Whether to overwrite sources. Defaults to None (class default).

Returns:

Dict[str, Dict]: Results dictionary that maps each level to headers, values, sources, fallbacks.

get_rows

def get_rows(level: str,
             adms: ListTuple[str],
             headers: ListTuple[ListTuple] = (tuple(), tuple()),
             row_fns: ListTuple[Callable[[str], str]] = tuple(),
             names: Optional[ListTuple[str]] = None,
             overrides: Dict[str, Dict] = {}) -> List[List]

[view_source]

Get rows for a given level for scrapers limiting to those in names if given. Rows include header row, HXL hashtag row and value rows, one for each admin unit specified in the adms parameter. Additional columns can be included by specifying headers and row_fns. Headers are of the form (list of headers, list of HXL hashtags). row_fns are functions that accept an admin unit and return a string. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels.

Arguments:

  • level str - Level to get like national, subnational or single
  • adms ListTuple[str] - Admin units
  • headers ListTuple[ListTuple] - Additional headers in the form (list of headers, list of HXL hashtags)
  • row_fns ListTuple[Callable[[str], str]] - Functions to populate additional columns
  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.

Returns:

  • List[List] - Rows for a given level

get_values_sourcesinfo_by_header

def get_values_sourcesinfo_by_header(
        level: str,
        names: Optional[ListTuple[str]] = None,
        overrides: Dict[str, Dict] = {},
        has_run: bool = True,
        use_hxl: bool = True) -> Tuple[Dict, Dict]

[view_source]

Get mapping from headers to values and headers to sources information for a given level for scrapers limiting to those in names if given. Keys will be headers if use_hxl is False or HXL hashtags if use_hxl is True. Sometimes it may be necessary to map alternative level names to levels and this can be done using overrides. It is a dictionary with keys being scraper names and values being dictionaries which map level names to output levels.

Arguments:

  • level str - Level to get like national, subnational or single
  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).
  • overrides Dict[str, Dict] - Dictionary mapping scrapers to level mappings. Defaults to {}.
  • has_run bool - Only get results for scrapers marked as having run. Defaults to True.
  • use_hxl bool - Whether keys should be HXL hashtags or column headers. Defaults to True.

Returns:

Tuple[Dict, Dict]: Tuple of (headers to values, headers to sources)

get_sources

def get_sources(
        names: Optional[ListTuple[str]] = None,
        levels: Optional[ListTuple[str]] = None,
        additional_sources: ListTuple[Dict] = tuple(),
        should_overwrite_sources: Optional[bool] = None) -> List[Tuple]

[view_source]

Get sources for scrapers limiting to those in names if given. All levels will be obtained unless the levels parameter (which can contain levels like national, subnational or single) is passed. Additional sources can be added. Each is a dictionary with indicator (specified with HXL hash tag), dataset or source and source_url as well as the source_date or whether to force_date_today.

Arguments:

  • names Optional[ListTuple[str]] - Names of scrapers
  • levels Optional[ListTuple[str]] - Levels to get like national, subnational or single
  • additional_sources ListTuple[Dict] - Additional sources to add
  • should_overwrite_sources Optional[bool] - Whether to overwrite sources. Defaults to None (class default).

Returns:

  • List[Tuple] - Sources in form (indicator, date, source, source_url)

get_source_urls

def get_source_urls(names: Optional[ListTuple[str]] = None) -> List[str]

[view_source]

Get source urls for scrapers limiting to those in names if given.

Arguments:

  • names Optional[ListTuple[str]] - Names of scrapers

Returns:

  • List[str] - List of source urls

get_hapi_metadata

def get_hapi_metadata(names: Optional[ListTuple[str]] = None,
                      has_run: bool = True) -> Dict

[view_source]

Get HAPI metadata for all datasets. A dictionary is returned that maps from dataset ids to a dictionary. The dictionary has keys for dataset metadata and a key resources under which is a dictionary that maps from resource ids to resource metadata.

Arguments:

  • names Optional[ListTuple[str]] - Names of scrapers
  • has_run bool - Only get results for scrapers marked as having run. Defaults to True.

Returns:

  • Dict - HAPI metadata for all datasets

get_hapi_results

def get_hapi_results(names: Optional[ListTuple[str]] = None,
                     has_run: bool = True) -> Dict

[view_source]

Get the results (headers and values per admin level and HAPI metadata) for scrapers limiting to those in names if given and limiting further to those that have been set in the constructor if previously given. By default, only scrapers marked as having run are returned unless has_run is set to False.

A dictionary is returned where key is HDX dataset id and value is a dictionary that has HAPI dataset metadata as well as a results key. The value associated with the results key is a dictionary where each key is an admin level. Each admin level key has a value dictionary with headers, values and HAPI resource metadata. Headers is a tuple of (column headers, hxl hashtags). Values is a list. HAPI resource metadata is a dictionary.

Arguments:

  • names Optional[ListTuple[str]] - Names of scrapers. Defaults to None (all scrapers).
  • has_run bool - Only get results for scrapers marked as having run. Defaults to True.

Returns:

  • Dict - Headers and values per admin level and HAPI metadata for all datasets