Overview
- To develop a Legislative data extraction & processing engine
- Client’s business utilizes legislative and regulatory information of federal and state
- Governments to build a legislative search platform. Scope includes US federal, all US states and 20+ Countries with their states/provinces.
Solution
System comprises web scrapers that are responsible for extracting data from dynamic Content web pages and deliver the extracted data to a relational database
Delivered scrapers for different countries after thorough testing and are already deployed on production.
- Extraction: Data harvesting through web scraping of source web pages or from web APIs (where available)
- Transformation: Transform values of inconsistent data, monitor, cleanse “bad” data, filter and validate data.
- Loading: Loads refined and processed data to a database or a data warehouse.
- Quality Assurance: Automated testing of loaded data supplemented with manual QA for added assurance.
Tech Stack
- BigData/Hadoop, Hive, Impala, Oozie, Python, SQL, Tableau BI, UI/UX on ReactJS, Java Spring boot, CI/CD -TeamCity/Jenkins/Hudson
- Core technologies used are Python, PostgreSql, RabbitMQ, Apache Thrift