Hackpads are smart collaborative documents. .

Keller Scholl

1037 days ago
Unfiled. Edited by Christopher Scranton , Keller Scholl 1037 days ago
Data Liberation Project
Lyre C
  • Who's spending dark money in NH elections?
  • 1000 lobbyists. Who pays them?
  • Who pays how much in independent expenditures?
  • Need help pulling files off of a website, scrape pdfs/handwritten data, represent the data
Christopher S Project Goals:
  1. Secure downloads of info source file PDFs from state website (approx 1,000 files) and aggregate into centralized online repository
  1. identify files and file structure of website
  1. build tool to automatically scrape the website, download the files, and deposit them individually to an online location
  1. ensure that the scrape tool can be run periodically, potentially by a novice computer user (" 1-click trigger?)
  1. Process PDF data into machine-readable data
  1. Identify types and numbers of 1) PDFs with printed answers (OCR potential) and 2) PDFs with handwritten answers (OCR challenged)
  1. Identify options for OCR tools that can process Type 1 "Printed Answers" PDFs
  1. Run OCR on all documents, check for errors, & identify initial counts of Type 1 vs Type 2 PDFs.
  1. For Type 2 "Handwritten Answers" PDFs, identify either
  1. robust handwriting-recognition tools & test on samples, or
  1. collect Type 2 PDFs for crowd-sourced (min. double-redundant) human data entry
Project Participants
  • Xanni Brown - xanni@opendemocracy.me
  • Scott
Keller S
  • Keller Scholl - Keller.scholl@gmail.com

Contact Support

Please check out our How-to Guide and FAQ first to see if your question is already answered! :)

If you have a feature request, please add it to this pad. Thanks!

Log in