Position: PhD Candidate
Current Institution: University of California, Berkeley
Abstract: Helena: A Web Automation Language for End Users
Web data is revolutionizing the social sciences. Researchers envision a diverse range of studies facilitated by the unique properties of web data, including its scale, ecological validity, and timeliness. With the wide variety of web scripting libraries offered, programmers have access to increasing language support for collecting web data. However, these libraries are inaccessible to non-programmers, and empowering non-programmers to collect these datasets is a long-standing open problem. To democratize access to web data, we designed the Helena web automation language. Helena brings together the following key innovations, which together empower end users to write robust web scraping programs: (1) The Helena programming environment uses Programming by Demonstration (PBD), which makes scripts easy to write; the tool takes a single-shot learning approach, creating scripts based on recording a single interaction of the user with a set of webpages. Empirically, users can learn the tool and use it to write a robust large-scale scraping script in under 10 minutes, while programmers tackling the same task with the traditional Selenium language time out after an hour. (2) Helena’s adaptive replayer makes scripts robust to webpage redesigns and obfuscation, which enables longitudinal experiments. (3) Helena’s novel runtime can parallelize and distribute scraping programs for speedups over 50x, facilitating large-scale scraping. Our approach relied on novel insights into the web scraping domain but also on bringing new techniques to bear. By combining techniques from the Programming Languages community and the Human-Computer Interaction community, we arrived at a language design that meets real users’ needs.
Sarah Chasins is a PhD candidate at UC Berkeley, advised by Ras Bodik. Her research interests lie at the intersection of programming languages and human-computer interaction (HCI). She works on end-user programming, program synthesis, and programming language design. Much of her work is shaped by ongoing collaborations with social scientists from fields ranging from sociology to economics to public policy. She believes well-designed languages and programming environments can put complicated programming tasks in range for people who consider themselves non-coders. She has received an National Science Foundation (NSF) graduate research fellowship and a first-place award in the Association for Computing Machinery (ACM) Student Research Competition.