Table of Contents

Overview

  • To develop a Legislative data extraction & processing engine
  • Client’s business utilizes legislative and regulatory information of federal and state
  • Governments to build a legislative search platform. Scope includes US federal, all US states and 20+ Countries with their states/provinces.

Solution

System comprises web scrapers that are responsible for extracting data from dynamic Content web pages and deliver the extracted data to a relational database

Delivered scrapers for different countries after thorough testing and are already deployed on production.

  • Extraction: Data harvesting through web scraping of source web pages or from web APIs (where available)
  • Transformation: Transform values of inconsistent data, monitor, cleanse “bad” data, filter and validate data.
  • Loading: Loads refined and processed data to a database or a data warehouse.
  • Quality Assurance: Automated testing of loaded data supplemented with manual QA for added assurance.

Tech Stack

  • BigData/Hadoop, Hive, Impala, Oozie, Python, SQL, Tableau BI, UI/UX on ReactJS, Java Spring boot, CI/CD -TeamCity/Jenkins/Hudson
  • Core technologies used are Python, PostgreSql, RabbitMQ, Apache Thrift
Enabling Unstructured Data to develop a Legislative data extraction & processing engine