UW CSE affiliate professor Bill Howe, associate director of the UW’s eScience Institute, was interviewed for a USA Today story on the recent Panama Papers leak. Howe explained the way in which today’s data science tools—from low-cost cloud services to easily available translation and data mining software—made it possible to rapidly untangle and analyze the contents of more than 11 million documents from Panama-based law firm Mossack Fonseca.
From the article:
“Dealing with a data drop like the Panama Papers has gotten much easier in the past decade, with the advent of cheap storage, cloud computing, easy-to-use and often free data mining software and faster computers.
“The first order of business after exfiltrating the files from the law firm’s computer network would be to find somewhere to look at it.
“‘This has gotten much easier with cloud computing. I swipe my credit card and I have as many machines as I need and it’s not expensive, so if I need 500 machines to work on this, I can get them up and running in a weekend,’ said Howe.
“Many of the documents appear to have been images, so the next task is extracting the text, something that’s also become significantly easier with time.
“‘There are off-the-shelf optical character recognition tools you can use. And crucially, if you scale this out over lots of computers, you can do hundreds [of pages] at a time….Every step along the way definitely requires some technical skills, but there’s nothing there that’s requiring a Ph.D. in computer science, quite frankly,’ Howe said.”
Read more about how technology enabled a data dump of global proportions in the full article here.