In the February 28, 2013 second installment of Introducing People of ACM interview, David Patterson, director of the Parallel Computing Lab at UC Berkeley and former ACM president, answers questions, revealing his insight into the pervasive and booming expansion of big data now inherent in the computing technology field.
He describes his successes over 35 years as researcher and professor at Berkeley as the embodiment of projects developed by grad students that would later be adapted into commercial products, most notably:
- Reduced Instruction Set Computers (RISC)
- Redundant Array of Inexpensive Disks (RAID)
- Networks of Workstations (NOW)
Patterson discusses how his AMP lab – algorithms, machine, people – will address the expectations of big data analytics in the field of health care, and in particular, cancer research through the intersection of machine learning, cloud computing, and crowd sourcing. As he notes, faster and more efficient software pipelines will need to be built in order to handle the data stored in the proposed “Million Genome Warehouse” in the hopes it can reveal critical information gleaned from millions of DNA signatures, tumor tracking, and treatment/outcomes records.
Finally, Patterson advises prospective data analysis technologists to look into the study of statistics and machine learning, as well as taking courses in databases and operating systems, all the while taking part in development of software as a service agile programming languages such as ruby on rails, python, and django.
Full interview found at http://www.acm.org/membership/acm-bulletin-archive/february-28-2013-introducing-people-of-acm-david-patterson-1. Transcript added below:
In this second installment of “People of ACM,” we are featuring David Patterson.
David Patterson is the founding director of the Parallel Computing Laboratory (PAR Lab) at University of California, Berkeley, which addresses the multicore challenge to software and hardware. He founded the Reliable, Adaptive and Distributed Systems Laboratory (RAD Lab), which focuses on dependable computing systems designs. He led the design and implementation of RISC I, likely the first VLSI Reduced Instruction Set Computer.
A former ACM president, Patterson chaired ACM’s Special Interest Group in Computer Architecture (SIGARCH), and headed the Computing Research Association (CRA). He is a Fellow of ACM, IEEE, the Computer History Museum, and the American Association for the Advancement of Science, and a member of the National Academy of Engineering and the National Academy of Sciences. He received the Eckert-Mauchly Award from ACM and IEEE-CS, and ACM’s Distinguished Service and Karl V. Karlstrom Outstanding Educator Awards. He served on the Information Technology Advisory Committee for the US President (PITAC).
Patterson is a graduate of the University of California at Los Angeles (UCLA), where he earned his A.B., M.S. and Ph.D. degrees. He has consulted for Hewlett Packard, (HP), Digital Equipment (now HP), Intel, Microsoft, and Sun Microsystems, and is on the technical advisory board of several companies.
As a researcher, professor, and practitioner of computer science, how have these overlapping roles influenced both your career and the direction of computing technologies?
My research style is to identify critical questions for the IT industry and gather interdisciplinary groups of faculty and graduate students to answer them as part of a five-year project. The answer is typically embodied in demonstration systems that are later mirrored in commercial products. In addition, these projects train students who go on to successful careers.
When I look in the rear view mirror at my 35 years at Berkeley, I see some successes. My best-known projects were all born in Berkeley graduate classes:
- Reduced Instruction Set Computers (RISC): The R of the ARM processor stands for RISC. ARM is now the standard instruction set of Post PC devices, with nearly 9B ARM chips shipped last year vs. 0.3B x86 chips.
- Redundant Array of Inexpensive Disks (RAID): Virtually all storage systems offer some version of RAID today; RAID storage is a $25B business today.
- Networks of Workstations (NOW): NOW showed that Internet services were an excellent match to large sets of inexpensive computers connected over switched local area networks, offering low cost, scalability, and fault isolation. Today, these large clusters are the hardware foundation of search, video, and social networking.
The research shapes the teaching too. The RISC research led to the graduate textbook Computer Architecture: A Quantitative Approach and the undergraduate textbook Computer Organization and Design: The Hardware-Software Interface, both co-authored with John Hennessy of Stanford University.
How is your AMP Lab involved in addressing the challenges of Big Data research? What can we expect over the next decade in the development of Big Data research and its impact on cancer tumor genomics and other health care issues?
Quoting from our web site, working at the intersection of three massive trends: powerful machine learning, cloud computing, and crowdsourcing, the AMP Lab integrates Algorithms, Machines, and People to make sense of Big Data. We are creating a new generation of analytics tools to answer deep questions over dirty and heterogeneous data by extending and fusing machine learning, warehouse-scale computing, and human computation.
We validate these ideas on real-world problems, such as cancer genomics. Recently, biologists discovered that cancer is a genetic disease, caused primarily by mutations in our DNA. Changes to the DNA also cause the diversity within a cancer tumor that makes it so hard to eradicate completely. The cost of turning pieces of DNA into digital information has dropped a hundredfold in the last three years. It will soon cost just $1,000 per individual genome, which means we could soon afford to sequence the genomes of the millions of cancer patients.
We need to build fast, efficient software pipelines for genomic analysis to handle the upcoming tsunami of DNA data that will soon be flowing from these low-cost sequencing machines. Then we need a safe place to store the results. If we could create a warehouse that stores the DNA signatures of millions of cancer patients, tracks how the tumors change over time, and records both the treatments and the outcomes, we could create a gold mine of cancer fighting information. By participating, computer scientists can help ensure that such a “Million Genome Warehouse” is dependable, cost effective, and secure, and protects privacy.
We can’t yet know how many cancer patients the faster software pipelines and Million Genome Warehouses will help-it could be tens, hundreds, thousands, or millions each year-but the sooner we create these tools, the more lives we can save.
What advice would you give to budding technologists who are considering careers in computing in this burgeoning new era in data analysis?
Study statistics and machine learning along with traditional CS courses like databases and operating systems.
As Big Data will surely be in the cloud, practice developing for Software as a Service (SaaS) deployed in the cloud rather the shrink-wrap software aimed at ground-bound PCs. Since Agile development is a perfect match for fast-changing SaaS apps, take a modern software engineering course to learn about Agile as well as productive programming environments for SaaS apps like Ruby on Rails or Python and Django.