Hear Ye! » Transparent Text Symposium Liveblog

Please note: This post is at least 3 years old. Links may be broken, information may be out of date, and the views expressed in the post may no longer be held.

21
Sep 09
Mon

Transparent Text Symposium Liveblog

5:33:53pm: Similar to his code versions comparison, Fry did a study of the different editions of Darwin’s Evolution of the Species. Has some peripherally interesting uses, but a question was raised as to the usefulness to Darwin scholarship. That brings us to the end of the day’s talks.

5:20:56pm: Touching on Processing. I should learn that language, apparently the syntax is similar to Java. (There’s also a Javascript port of it that uses html5’s canvas tags to display output.)

Now showing a diff chart of the 40 versions of code for Processing (all on one screen).

5:18:36pm: He did a project for the World Economic Forum, and want to show how the sessions at Davos were connected – showing word connections. It’s pretty and… well, just pretty.

5:13:23pm: Ben Fry is up! Topic is “Darwin vs Code”

5:07:31pm: Now looking at State legislation (bills, to be specific). Shows that bills in different states have duplication – probably because they were submitted by the same interest groups. Eg, MN bill repeats a lot of language from an AK bill. See, eg, a Google search for the Firearms Freedom Act. The lobby group has published model legislation. “Open source for legislation”? Quaint way of putting it. It’s just a template law. I wonder what other work they’re doing apart from looking at duplication? I wonder if any research scholarship would find this duplication data interesting?

Now a guy is talking about compresing legal agreements – something that Creative Commons has done (icons to serve as proxies for legal texts). Must talk with him afterwards.

5:01:38pm: Dataset is 1000 Terms of Service. Looking for reptition. Acknowledge that “boilerplate exists”. Vivid depiction of boilerplate – shows how terms of art in law work – “magic words” get reused. Stacked view of duplicated language is good for showing what’s not boilerplate and therefore what’s deal specific. Interesting and useful. Now showing graph of websites linked by degree of similarity in their TOS.

4:58:21pm: Now there’s an analysis of personals ads – shows prevalence of repetition between ads. Now for legal text. I’m waiting in anticipation.

4:53:23pm: Now showing building graphs of words (graphs of nodes in a network), showing relationships of words. For example, searching for (word) “and” (word) builds a graph of nodes connected by the word “and” in a novel (Pride and Prejudice in this example). Search for (word) “begat” (word) on the Bible produces a rough family tree.

4:52:04pm: Now showing a wordtree (which is kind of like a structured wordcloud) of HR 3200 (Healthcare Bill). Useful way of navigating the universe of phrases in a long doc. Now navigating through a personals ad wordtree – we quickly find out that “I am married” is most often followed by the word “but”, and second most popularly by “and”.

4:47:52pm: They just showed Google, and typed in “Why”. The suggested search queries that pop up are hilarious. Now showing comparison of google suggest results sets, and then showing matches between sets. Eg, “is my son” & “is my daughter”, “is my husband” & “is my wife”, “democrats are” & “republicans are” etc. Very humorous!

4:42:50pm: Fernanda & Martin on Visualizing Text, Many Eyes project. What do people want to know when they see a lot of text? Fernanda says she was interested in repetition in text. Maybe as a way to connect large bodies of documents, or indicate levels of interest. They’re going to show a few demos of (experiemental) visualization techniques.

4:37:32pm: David Small, MIT Media Lab Prof on Transparency & Focus. This presentation is… unorthodox and experimental. Good visualization techniques and graphical manipulation techniques for museum displays.

4:15:28pm: Ok I didn’t blog the session after lunch although there were some very very interesting presentations in it (Arie et al, I’ll send you an email about it later). Now startig the 3rd session.

12:25:47pm: Interesting point: people sourced to track news on specific politicians had a strong initial preference to only track the politician they had a personal interest in (leading to spotty coverage). However, after having a positive experience, begin to relax those preferences.

12:13:45pm: Amanda Michel from ProPublica is up now. Talking about how to best use open data, and about how to collect more data where data is limited. PP uses crowdsourcing to find/produce info. Going to run through some case studies.

12:04:36pm: Demo: looking for SEC documents about Madoff from current DB of government PDFs. Eg of search query: “company: Google company: IBM” pulls out a BigTable paper. Also: industry_term:”hot gas” brings up report of Challenger accident. Now demoing Cloud Crowd which uses parallel processing to process a raw PDF. They’re going to run PDFs through OCR (very computationally intensive). CC is open source, getting debugged and plugins. DC uses full-text searches as well as metadata searches.

11:58:39am: Document Reader component presents documents in an easy to read format (can also use DocStop, Scribd). Has an annotation tool which also cites the page number (click and drag interface) – journalistic layer on top of a document. In the future, they want to be able to take annotations out of viewer and make them portable – so you can see what people are saying about the portions of the text. Working with Aperture(?).

11:54:34am: DC is a consortium of about 20 organizations (most of which are unannounced). So combining documents from different orgs into one repository should allow people to find relationships between different types of docs. Eg, news items about a person, or geographic location, or relationally proximate to a person or location. Calais seems to have a globally unique identifier system – eg, “IBM”, “I.B.M.” and the IBM logo all point to one Calais URL. Calais is looking for more members (info@documentcloud.org) – whether news orgs, or academics, etc.

11:50:22am: Great example of inspiration for DC – NYTimes was breaking stories about inconsistent statements made by a politician who wondered how they were pulling “needles out of haystacks” – they use proprietary cataloging software (FastESP). DC’s cataloging software is OpenCalais (and someone from Calais will be speaking later in the conference). Calais automatically tags documents with metadata. Seems to be able to read in PDFs (including running OCR on them). In summary, DC is trying to build a layer of intuitively searchable metadata on top of the base documents.

Watching people trying to capture what a speaker is talking about through the #tt09 Twitter channel is painful.

11:42:25am: Aron Pilhofer (NYTimes) & Jeremy Ashkenas on DocumentCloud. DocumentCloud is Pilhofer’s “hobby”, arising out of a conversation he had in his boss’ office. DC turns unstructured text into structured data. It comprises several tools for analysis, processing, publishing and searching/cataloging. Analysis: Aim is to derive meaning from the library of documents which anyone can upload into the cloud. Processing: Technical challenges to processing a lot of documents quickly – uses parallel processing (Mapreduce-type techniques?) to accomplish this. Publishing: making documents publicly accessible + presentation tools. Searching: by providing metadata. Focus today is on first 3 components.

11:33:10am: Hah, someone’s making the exact point in the Q&A session going on now that Andreas made. Irrelevant sidenote: Ellen Miller kinda reminds me of Angela Petrelli from Heroes.

11:29:46am: Twitter channel at #tt09 is filled with mostly non-useful chatter about who’s speaking, their topic and possibly some form of glowing compliment.

Sunlight Foundation is showing interesting visualizations of Congressional transcripts and the frequency with which politicians use different words. Text visualizations tend to be just word counts or single word/word-pair comparisons. Lots of word clouds. I’m not sure that’s very useful as you lose a lot of meaning, and also why isn’t a “top 10” list of words (together with a frequency count) better than a word cloud? If I have a word in 92 pt font and another in 72 pt font, this doesn’t really confer any information about the relative frequency of those words. As Andreas mentioned, people are good at making things look pretty, but we need to take a long hard look at whether it is useful and more effective.

Conference website.

11:23am (GMT -4.00) • Internet • • Tweet This • Add a comment

This post has no comments. Add yours below.

Add a Comment

You must be logged in to post a comment.

10 days spent in current location
Previously	Dallas, TX
» Currently	San Francisco, CA (19/4/24 9:36am)
View full itinerary...