hdx.scraper.configurable.scraper

ConfigurableScraper Objects

class ConfigurableScraper(BaseScraper)

[view_source]

Each configurable scraper is configured from dataset information that can come from a YAML file for example. When run, it works out headers and values. It also overrides add_sources where sources are compiled and returned. If dealing with subnational data, adminlevel must be supplied.

Arguments:

  • name str - Name of scraper
  • datasetinfo Dict - Information about dataset
  • level str - Can be national, subnational or single
  • countryiso3s List[str] - List of ISO3 country codes to process
  • adminlevel Optional[AdminLevel] - AdminLevel object from HDX Python Country. Defaults to None.
  • level_name Optional[str] - Customised level_name name. Defaults to None (level).
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • today datetime - Value to use for today. Defaults to now_utc().
  • errors_on_exit Optional[ErrorsOnExit] - ErrorsOnExit object that logs errors on exit
  • **kwargs - Variables to use when evaluating template arguments in urls

get_subsets_from_datasetinfo

@staticmethod
def get_subsets_from_datasetinfo(datasetinfo: Dict) -> List[Dict]

[view_source]

Get subsets from dataset information

Arguments:

  • datasetinfo Dict - Information about dataset

Returns:

  • List[Dict] - List of subsets

get_iterator

def get_iterator() -> Tuple[List[str], Iterator[Dict]]

[view_source]

Get the iterator from the preconfigured reader for this scraper

Returns:

  • Tuple[List[str],Iterator[Dict]] - Tuple (headers, iterator where each row is a dictionary)

add_sources

def add_sources() -> None

[view_source]

Add source for each HXL hashtag

Returns:

None

read_hxl

def read_hxl(iterator: Iterator[Dict]) -> Optional[Dict[str, str]]

[view_source]

Read HXL tags if use_hxl is True and return the mapping as a dictionary. If use_hxl if False, return None.

Arguments:

  • iterator Iterator[Dict] - Iterator where each row is a dictionary

Returns:

Optional[Dict[str, str]]: Dictionary mapping from headers to HXL hash tags or None

use_hxl

def use_hxl(headers, file_headers: List[str],
            iterator: Iterator[Dict]) -> Optional[Dict]

[view_source]

If the configurable scraper configuration defines that HXL is used (use_hxl is True), then read the mapping from headers to HXL hash tags. Since each row is a dictionary from header to value, the HXL row will be a dictionary from header to HXL hashtag. Label country and adm1 columns as admin columns. If the input columns to use are not specified, use all that have HXL hashtags. If the output column headers or hashtags are not specified, use the ones from the original file.

Arguments:

  • file_headers List[str] - List of all headers of input file
  • iterator Iterator[Dict] - Iterator over the rows

Returns:

  • Optional[Dict] - Dictionary that maps from header to HXL hashtag or None

run_scraper

def run_scraper(iterator: Iterator[Dict]) -> None

[view_source]

Run one configurable scraper given an iterator over the rows

Arguments:

  • iterator Iterator[Dict] - Iterator over the rows

Returns:

None

run

def run() -> None

[view_source]

Runs one configurable scraper given dataset information

Returns:

None

hdx.scraper.configurable.rowparser

RowParser Objects

class RowParser()

[view_source]

RowParser class for parsing each row.

Arguments:

  • name str - Name of scraper
  • countryiso3s List[str] - List of ISO3 country codes to process
  • adminlevel Optional[AdminLevel] - AdminLevel object from HDX Python Country library
  • level str - Can be national, subnational or single
  • datelevel str - Can be global, regional, national, subnational
  • today datetime - Date today
  • datasetinfo Dict - Dictionary of information about dataset
  • headers List[str] - Row headers
  • header_to_hxltag Optional[Dict[str, str]] - Mapping from headers to HXL hashtags or None
  • subsets List[Dict] - List of subset definitions
  • maxdateonly bool - Whether to only take the most recent date. Defaults to True.

read_external_filter

def read_external_filter(external_filter: Optional[Dict]) -> None

[view_source]

Read filter list from external url pointing to a HXLated file

Arguments:

  • external_filter Optional[Dict] - External filter information in dictionary

Returns:

None

get_filter_str_for_eval

def get_filter_str_for_eval(filter: str) -> str

[view_source]

Replace filter string variables with columns in row of data

Arguments:

  • filter str - Filter string

Returns:

  • str - Filter string with variables replaced

filter_sort_rows

def filter_sort_rows(iterator: Iterator[Dict]) -> Iterator[Dict]

[view_source]

Apply prefilter and sort the input data before processing. If date_col is specified along with any of sum or process, and sorting is not specified, then apply a sort by date to ensure correct results.

Arguments:

  • iterator Iterator[Dict] - Input data

Returns:

  • Iterator[Dict] - Input data with prefilter applied if specified and sorted if specified or deemed necessary

flatten

def flatten(row: Dict) -> Generator[Dict, None, None]

[view_source]

Flatten a wide spreadsheet format into a long one

Arguments:

  • row Dict - Row to flatten

Returns:

  • Generator[Dict] - Flattened row(s)

get_maxdate

def get_maxdate() -> datetime

[view_source]

Get the most recent date of the rows so far

Returns:

  • datetime - Most recent date in processed rows

filtered

def filtered(row: Dict) -> bool

[view_source]

Check if the row should be filtered out

Arguments:

  • row Dict - Row to check for filters

Returns:

  • bool - Whether row is filtered out or not

parse

def parse(row: Dict) -> Tuple[Optional[str], Optional[List[bool]]]

[view_source]

Parse row checking for valid admin information and if the row should be filtered out in each subset given its definition.

Arguments:

  • row Dict - Row to parse

Returns:

Tuple[Optional[str], Optional[List[bool]]]: (admin name, should process subset list) or (None, None)

hdx.scraper.configurable.resource_downloader

ResourceDownloader Objects

class ResourceDownloader(BaseScraper)

[view_source]

Each resource downloader is configured from dataset information that can come from a YAML file for example. When run it downloads the resource described in the dataset information from HDX and puts it in the given folder.

Arguments:

  • datasetinfo Dict - Information about dataset
  • folder str - Folder to which to download. Defaults to "".

run

def run() -> None

[view_source]

Runs one resource downloader given dataset information

Returns:

None

add_sources

def add_sources() -> None

[view_source]

Add source for resource download

Returns:

None

hdx.scraper.configurable.timeseries

TimeSeries Objects

class TimeSeries(BaseScraper)

[view_source]

Each time series scraper is configured from dataset information that can come from a YAML file for example. When run, it populates the given outputs with time series data. It also overrides add_sources where sources are compiled and returned.

Arguments:

  • name str - Name of scraper
  • datasetinfo Dict - Information about dataset
  • outputs Dict[str, BaseOutput] - Mapping from names to output objects
  • today datetime - Value to use for today. Defaults to now_utc().

run

def run() -> None

[view_source]

Runs one time series scraper given dataset information and outputs to whatever outputs were specified in the constructor

Returns:

None

add_sources

def add_sources() -> None

[view_source]

Add source for each HXL hashtag

Returns:

None

hdx.scraper.configurable.aggregator

Aggregator Objects

class Aggregator(BaseScraper)

[view_source]

Each aggregator is configured from dataset information that can come from a YAML file for example. When run, it works out headers and aggregated values. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). The input_values are a mapping from headers (if use_hxl is False) or HXL tags (if use_hxl is True) to column values expressed as a dictionary mapping from input admin to value. If any formulae require the result of other aggregators, these can be passed in using the aggregation_scrapers parameter.

Arguments:

  • name str - Name of aggregator
  • datasetinfo Dict - Information about dataset
  • adm_aggregation Union[Dict, ListTuple] - Mapping from input admins to aggregated output admins
  • headers Dict[str, Tuple] - Column headers and HXL hashtags
  • use_hxl bool - Whether to map from headers or from HXL tags
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • aggregation_scrapers List["Aggregator"] - Other aggregations needed. Defaults to [].

get_scraper

@classmethod
def get_scraper(
        cls,
        use_hxl: bool,
        header_or_hxltag: str,
        datasetinfo: Dict,
        input_level: str,
        output_level: str,
        adm_aggregation: Union[Dict, ListTuple],
        input_headers: Tuple[ListTuple, ListTuple],
        source_configuration: Dict = {},
        aggregation_scrapers: List["Aggregator"] = []
) -> Optional["Aggregator"]

[view_source]

Gets one aggregator given dataset information and returns headers, values and sources. The mapping from input admins to aggregated output admins adm_aggregation is of form: {"AFG": ("ROAP",), "MMR": ("ROAP",)}. If the mapping is to the top level, then it is a list of input admins like: ("AFG", "MMR"). The input_values are a mapping from headers or HXL tags to column values expressed as a dictionary mapping from input admin to value.

Arguments:

  • use_hxl bool - Whether to map from headers or from HXL tags
  • header_or_hxltag str - Column header or HXL hashtag depending on use_hxl
  • datasetinfo Dict - Information about dataset
  • input_level str - Input level to aggregate like national or subnational
  • output_level str - Output level of aggregated data like regional
  • adm_aggregation Union[Dict, ListTuple] - Mapping from input admins to aggregated output admins
  • input_headers Tuple[ListTuple, ListTuple] - Column headers and HXL hashtags
  • runner(Runner) - Runner object
  • source_configuration Dict - Configuration for sources. Defaults to empty dict (use defaults).
  • aggregation_scrapers List["Aggregator"] - Other aggregations needed. Defaults to [].

Returns:

  • Optional["Aggregator"] - The aggregation scraper or None if it couldn't be created

get_float_or_int

@staticmethod
def get_float_or_int(valuestr: str) -> Union[float, int, None]

[view_source]

Convert value string to float, int or None

Arguments:

  • valuestr str - Value string

Returns:

Union[float, int, None]: Converted value

get_numeric

@classmethod
def get_numeric(cls, valueinput: Any) -> Union[str, float, int]

[view_source]

Convert value input to float or int. Values in pipe separated strings are summed. If any values in a pipe separated string are empty, an empty string is returned.

Arguments:

  • valueinput Any - Value string

Returns:

Union[str, float, int]: Converted value

process

def process(output_level: str, output_values: Dict) -> None

[view_source]

Perform aggregation putting results in output_values

Arguments:

  • output_level str - Output level of aggregated data like regional
  • output_values Dict - Mapping from admin name to value

Returns:

None

run

def run() -> None

[view_source]

Runs one aggregator given dataset information

Returns:

None

add_sources

def add_sources() -> None

[view_source]

There is no need to add any sources since the disaggregated values should already have sources

Returns:

None