Hear Ye! Since 1998.

Archived Posts for September 2009

Please note: The posts on this page are at least 3 years old. Links may be broken, information may be out of date, and the views expressed in the posts may no longer be held.
Sep 09

A new season of TV

After watching the first episode of Season 9 of Smallville, it’s a mystery why this show is still on the air. The quality of writing has declined from mediocre to awful. What’s an even bigger mystery is why I still continue to watch it.

Not sure about Heroes. First episode didn’t really grab my attention, the same themes seem to be getting replayed.

New season of Survivor is shaping up well. Burning a sock. What a prick, but what twisted genius. Certainly makes for interesting viewing. Amazing Race has started up too.

Entourage continues to impress, as does The Office. And Californication! It’s been too long. Now they just need to have a decent sci-fi show on air…

Sep 09

  stuloh The Informer!: 3/5. 'twas ok. The underlying true story doesn't seem all that remarkable when you put it next to Madoff's $65bn fraud.

  12:04am  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09

  stuloh Virgin America is now my favorite airline. The other domestic carriers don't even come close!

  1:48pm  •  Tweet  •  Tweet This  •  Add a comment  • 

  stuloh District 9: 4.5/5. Loved it. Great scifi. I also like Seth Efriken ekkcents.

  6:29am  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09

  stuloh TT Symposium Day 2 - live blog: http://bit.ly/2w8Lla #tt09

  5:26am  •  Tweet  •  Tweet This  •  Add a comment  • 

Transparent Text Symposium Liveblog – Day 2

1:43:10pm: A comment from the panel – Americans listen to podcasts because they don’t like to read. Might miss part of the picture – people read faster than they can speak, but you can’t multitask while reading, which might be the real reason some may prefer to listen.

12:33:45pm: Example 5: Analysis of the phrases used by newspapers. “Unborn child” more likely to be used by conservative papers. The use of words over time in the context of defense animation is pretty damn cool.

12:20:54pm: Kevin Quinn, Prof of Law at Berkeley, talking about visualizing political speech. Main thought: statistical models can be used to improve visual displays of data. They become increasingly useful for visualization as data size and complexity increases. This is especially true for textual data.

A statistical model is an assumption about how the text was generated – ie about the family of probability distributions that generated the observed data (typically indexed by certain parameters). There may be some background data which leads you to believe what the parameters may be.

Example: senate speech data from ’97 to ’04. Over 118k speeches. Speeches were stemmed, generating 4k unique stems. 74m stem observations. Looking at frequency of unigrams (single words). Agglomerative clustering generates 42 topics into which speeches were classified (eg, symbolic speech, international affairs, procedure, constitutional, etc). [but does it account for one speech having multiple topics?]

Example 2: political positions of newspapers. Looked at Ed columns of newspapers to see their stance on SCOTUS opinions. Then they place the stances of justices on a spectrum together with newspapers. NYT is more liberal than all of the justices. But Thomas and Scalia are more right than any newspaper.

Example 3: Invocation of commonsense. More conservative papers are likely to invoke common sense when describing SCOTUS opinions.

Example 4: Show which newspapers include a selected quote from a case, with x-axis being the political position of the newspaper.

11:56:21am: Emily Calhoun from MAPLight.org. Nonprofit, funded by the Sunlight Foundation (among others). Berkeley-based. Small staff. They do a mashup of financial and voting data (sourced from Center for Responsive Politics, Natioanl Institute on Money in State Politics, LA City Ethics Commission, GovTrack.us, and CA Leg Counsel). They have a nice, clean graph embedding feature so you can show a graph you dynamically generated on your website.

They have an interest in analysing the substance of bills.

11:31:11am: To implement the crowdsourcing project – fairly low costs – 1 week of developer time. Couple days for a designer. 450k+ of documents available for processing. Entertaining case study.

11:26:43am: Now showing crowdsourcing the processing of MP receipts, claim/expense forms. I used to do this stuff when I was working part time as an undergrad – processing application forms, indexing and data entry. This is the same thing, except that anyone who feels like doing it can log on to the website and do it. There are data quality issues, but there are ways around that.

11:21:11am: Come on, call a spade a spade… it’s a graph of MP spending. Sure it’s a “data visualization” of it, but speak like the person on the street and call it a graph. Oh, it doesn’t sound sexy? Get over it. Using “graph” will also help you fit into Twitter’s character limit as well.

11:20:06am: Simon Rogers, The Guardian (UK). MPs’ expenses – how to crowdsource 400k documents.

Showing infographic of government spending broken down by department and functions. Guardian created the Data Store, where they upload spreadsheets to Google Docs. It’s a collecting point for info. The Guardian manually copies data from paper sheets and turn it into graphs.

Case study about analysis of MP allowances claims scandal.

11:07:41am: Building Watson: Overview of the DeepQA Project. Training a computer to play Jeopardy. There’s a big challenge to parse natural language to understand what a question is asking.

9:22:48am: Building Watson: Overview of the DeepQA Project. Training a computer to play Jeopardy. There’s a big challenge to parse natural language to understand what a question is asking.

8:58:21am: Color popularity on the web – differs by TLCC domains.

8:56:25am: 95k+ documents, 2gb of federal legislation taken (10 word snips), generates 28gb of index = lots of repetition. Lots of overlapping language… no more analysis done

8:47:55am: Semantic Super Computing: Finding the Big Picture in Text Documetns (Daniel Gruhl, IBM Almaden Research Center in CA). Basic processing flow: PBs worth, then create 10x metadata as there is the source docs. Then index everything. Creating metadata dynamically: as more is generated, it may alter the metadata previously generated as more context is understood.

8:33:17am: Evaluation of clustering method is amazing. Looks to create an appreciable difference in clustering.

8:20:57am: Automated methods for conceptualising text. Computer-assisted conceptualization is arguably more effective. Using a classification based approach, specifically cluster analysis (simultaneously creating categories and assigning documents to categories). Humans can’t conduct cluster analysis manually very well.

Bell number = number of ways of partioning n objects. Bell(2) = 2,f(3) = 5, f(5) = 52, f(100) ~ 10^28 * number of elementary particles in the universe. So, the goal of creating an optimal application-independent cluster analysis method is mathematically impossible.

We could create a long list, but too hard to process and pick the best clustering unless we organize the list first. To organize, we develop a conceptual “geography” of clusterings.

First, code text as numbers. Second, apply all clustering methods we can find to the data within 15 mins. (This produces too many results for a person to understand.) Third, develop an application-independent distance metric between clusterings, a metric space of clusterings and a 2D projection. Fourth, local cluster ensemble, then create animated visualization, which then generates a smaller list which is comprehensible. We can then pick what’s best for us.

8:08:39am: And we’re back. Next up is “Quantitative Discovery of Qualitative Information : A General Purpose Document Clustering Methodology”. Gary King, from Harvard, is presenting.

7:43:23am: Break.

7:40:08am: Design case studies by thincdesign.

7:36:14am: Brain Movies. Shows brain activity when viewing movies.

7:29:31am: Unicode. There is no such thing as plain text, “avoiding mojibake”. Advice is to use Unicode (UTF-8 to be specific) encoding for text files. Telling us about character sets. Be explicit. [Eg, in HTML: <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />]

7:23:37am: Transparent Control Systems. Purpose of text transparency is to “provide daylight on governance system. But governance systems are really control systems. If we want to effect change, it will be in teh control systems, not the data.” [Law as Code Lessig reference?] Can break control systems down to make them easier to understand. Example using an engine piston.

7:17:50am: VUE (Visual Understanding Environment). Pulls out tags generated by OpenCalais into a Viewmap. Seasr integrated into VUE to add things (notes and metadata) to nodes. Can load XML, CSV, RSS files. Rapid visualization of structured data?

7:12:39am: Dido. David Karger at the MIT Haystack group. Text is easier to write in than structured data. To incentivize people to write structured data. Contains in-page visualizations.

7:10:36am: Linked Open Data. Manipulating linked sets of data. \”What was each senator\’s vote on the economic stimulus along with the population and median income of the state he or she represents?\” No easy way of doing this (not even Wolfram Alpha yet :). Showing Guanxi, which is used to discover connections. Showing Audrey, which is about discovering and sharing news stories (like Google Reader but with a stronger discovery/recommendation service via social networking component).

7:02:38am: Data Portraits. They think that presenting data in a picture is more interesting (may or may not be unintentionally trivializing their message here). 1. Showing Fernanda’s Themail project (email analysis program). Xobni has a similar feature (but Xobni’s not as pretty). 2. Lexigraphs, showing most used words in a person’s twitter stream + most recent words used – words are placed in outline of person as art. 3. Mycrocosm. People keep track of things themselves using graphs. Reminds me of a basic version of Feltron’s Annual Report. 4. Personas.

6:57:09am: We Meddle. Identifying the linear nature of Twitter as a limitation. Showing We Meddle interface – filtering system – bigger text for people “closer” to you. But can reverse if interested in seeing what people more distant from you are saying (since you probably speak to people closer to you in person more often).

6:52:26am: Twittermood. A display of US mood swings. Another Twitter analyzer. Showing NYT interactive infographic showing word prominence during the Superbowl. Trying to track US “happiness”. [Kind of presumptuous to label Twitter users as indicative of the US! But anyway.] Just flashed up an amusing graphic showing that people tend to be happy in Central Park. Users Twitter gardenhose stream (~1.8m messages per day).

Conference Twitter stream still outrageous, why do I need 10 people telling me the same thing every 300 seconds?

6:47:55am: Bengali Intellectuals in the Age of Decolonization. Showing architecture of the project – SMIL file and gazetteer info gets fed into a database then output to pages. Slice and dice video clips. Concepts using Open Calais (tagging) and commenting + geographic and chronological searching. Lecture Capture system tie in to generate transcripts. Ok, seems like a metadata system built on some of video media. Tufts project.

6:43:44am: Picturing to Learn. Students drawing indicates their understanding/misunderstanding of scientific concepts (in the examples shown, chemistry concepts). Now tracking and tagging scans of these drawings. Looking for help.

6:38:48am: NY Times R&D. Emerging trends: web of documents is being transformed. Pages are being disaggregated into their component pieces of data and repackaged. What does it mean for NYT to be a media brand in a disaggregated world? Freebase. NYT likes to collaborate around visualization with other (design) organizations. Cloud computing: shifting from one multi-purpose computer to many multi-purpose computing devices. This means that data, content and text has to “get smarter” – needs to know how it’s being displayed and who’s watching it, and the context in which it’s being displyed. Focus on device-independent media (for print and screen).

6:33:39am: Transparency and Exploratory Search. Something about search and metadata.

6:28:43am: nFluent. They’re an IBM research group dealing with translation. [I saw them demo this yesterday – it’s like Google Translate, but they are building up their translation engine in a slightly different way to Google – they are determined it’s uniquely different, but I’m not so gung ho about that.] Their translation engine translates pages as they load in real time and people can highlight incorrectly translated words, and offer correct translations. Crowdsourcing approach. They have some interesting trends – analysing user contributions: 1% are “superusers” (contributing 30% of the data, by words translated), 65% are “refiners”, 33% of consumers (normally 90% in other contexts). Gong didn’t go off, excellent!

6:23:22am: Day 2, Ignite-style presentations (5 minute presentations about what people/companies are doing). They’re using a gong to cut short presentations. My side comments in square brackets.

Web Ecology Project. Analysis of data on social networks (Twitter, Friendfeed, etc).

Transparent News Articles via Semantic Models. Parses news articles through a browser plug-in. User can highlight words to find out more about a topic. Eg, highlight “Oracle’s $11.1 bn acquisition of PeopleSoft” brings up a context menu where you can select, “What happened to this acquisition?” which then brings up, inline, a timeline of the relevant M&A activity. Clicking on the timeline loads up a relevant article. Uses an engine they developed called Brussels for parsing.

MassDOT Developers. Speaker from Mass Exec Office of Transportation. MassDOT brings together several Mass agencies. Now moving to open data. If you wanted to make a transport app [eg, an iPhone app] then you would have to scrape the data and then worry about IP issues. MassDOT is opening up their data (from various agencies) to third party developers. [Cityrail take note!] Within a month, three transport apps were developed. Eg, MassTransit, To a T, RMV Wait Times. These apps wouldn’t have been developed otherwise. [I have no idea why Cityrail wants to close their data, other than it might expose how late their trains always run.] Challenge is extracting data, such as GPS data for buses, from systems which weren’t built with this in mind (ie exporting to XML or other accessible formats). Figuring out when a bus arrives is “lifechanging”, especially when the weather in winter is -10 degrees… Fahrenheit.

Day 2.

Sep 09

  stuloh Liveblogging the TT symposium here: http://bit.ly/GA7au #tt09

  7:30am  •  Tweet  •  Tweet This  •  Add a comment  • 

Transparent Text Symposium Liveblog

5:33:53pm: Similar to his code versions comparison, Fry did a study of the different editions of Darwin’s Evolution of the Species. Has some peripherally interesting uses, but a question was raised as to the usefulness to Darwin scholarship. That brings us to the end of the day’s talks.

5:20:56pm: Touching on Processing. I should learn that language, apparently the syntax is similar to Java. (There’s also a Javascript port of it that uses html5’s canvas tags to display output.)

Now showing a diff chart of the 40 versions of code for Processing (all on one screen).

5:18:36pm: He did a project for the World Economic Forum, and want to show how the sessions at Davos were connected – showing word connections. It’s pretty and… well, just pretty.

5:13:23pm: Ben Fry is up! Topic is “Darwin vs Code”

5:07:31pm: Now looking at State legislation (bills, to be specific). Shows that bills in different states have duplication – probably because they were submitted by the same interest groups. Eg, MN bill repeats a lot of language from an AK bill. See, eg, a Google search for the Firearms Freedom Act. The lobby group has published model legislation. “Open source for legislation”? Quaint way of putting it. It’s just a template law. I wonder what other work they’re doing apart from looking at duplication? I wonder if any research scholarship would find this duplication data interesting?

Now a guy is talking about compresing legal agreements – something that Creative Commons has done (icons to serve as proxies for legal texts). Must talk with him afterwards.

5:01:38pm: Dataset is 1000 Terms of Service. Looking for reptition. Acknowledge that “boilerplate exists”. Vivid depiction of boilerplate – shows how terms of art in law work – “magic words” get reused. Stacked view of duplicated language is good for showing what’s not boilerplate and therefore what’s deal specific. Interesting and useful. Now showing graph of websites linked by degree of similarity in their TOS.

4:58:21pm: Now there’s an analysis of personals ads – shows prevalence of repetition between ads. Now for legal text. I’m waiting in anticipation.

4:53:23pm: Now showing building graphs of words (graphs of nodes in a network), showing relationships of words. For example, searching for (word) “and” (word) builds a graph of nodes connected by the word “and” in a novel (Pride and Prejudice in this example). Search for (word) “begat” (word) on the Bible produces a rough family tree.

4:52:04pm: Now showing a wordtree (which is kind of like a structured wordcloud) of HR 3200 (Healthcare Bill). Useful way of navigating the universe of phrases in a long doc. Now navigating through a personals ad wordtree – we quickly find out that “I am married” is most often followed by the word “but”, and second most popularly by “and”.

4:47:52pm: They just showed Google, and typed in “Why”. The suggested search queries that pop up are hilarious. Now showing comparison of google suggest results sets, and then showing matches between sets. Eg, “is my son” & “is my daughter”, “is my husband” & “is my wife”, “democrats are” & “republicans are” etc. Very humorous!

4:42:50pm: Fernanda & Martin on Visualizing Text, Many Eyes project. What do people want to know when they see a lot of text? Fernanda says she was interested in repetition in text. Maybe as a way to connect large bodies of documents, or indicate levels of interest. They’re going to show a few demos of (experiemental) visualization techniques.

4:37:32pm: David Small, MIT Media Lab Prof on Transparency & Focus. This presentation is… unorthodox and experimental. Good visualization techniques and graphical manipulation techniques for museum displays.

4:15:28pm: Ok I didn’t blog the session after lunch although there were some very very interesting presentations in it (Arie et al, I’ll send you an email about it later). Now startig the 3rd session.

12:25:47pm: Interesting point: people sourced to track news on specific politicians had a strong initial preference to only track the politician they had a personal interest in (leading to spotty coverage). However, after having a positive experience, begin to relax those preferences.

12:13:45pm: Amanda Michel from ProPublica is up now. Talking about how to best use open data, and about how to collect more data where data is limited. PP uses crowdsourcing to find/produce info. Going to run through some case studies.

12:04:36pm: Demo: looking for SEC documents about Madoff from current DB of government PDFs. Eg of search query: “company: Google company: IBM” pulls out a BigTable paper. Also: industry_term:”hot gas” brings up report of Challenger accident. Now demoing Cloud Crowd which uses parallel processing to process a raw PDF. They’re going to run PDFs through OCR (very computationally intensive). CC is open source, getting debugged and plugins. DC uses full-text searches as well as metadata searches.

11:58:39am: Document Reader component presents documents in an easy to read format (can also use DocStop, Scribd). Has an annotation tool which also cites the page number (click and drag interface) – journalistic layer on top of a document. In the future, they want to be able to take annotations out of viewer and make them portable – so you can see what people are saying about the portions of the text. Working with Aperture(?).

11:54:34am: DC is a consortium of about 20 organizations (most of which are unannounced). So combining documents from different orgs into one repository should allow people to find relationships between different types of docs. Eg, news items about a person, or geographic location, or relationally proximate to a person or location. Calais seems to have a globally unique identifier system – eg, “IBM”, “I.B.M.” and the IBM logo all point to one Calais URL. Calais is looking for more members (info@documentcloud.org) – whether news orgs, or academics, etc.

11:50:22am: Great example of inspiration for DC – NYTimes was breaking stories about inconsistent statements made by a politician who wondered how they were pulling “needles out of haystacks” – they use proprietary cataloging software (FastESP). DC’s cataloging software is OpenCalais (and someone from Calais will be speaking later in the conference). Calais automatically tags documents with metadata. Seems to be able to read in PDFs (including running OCR on them). In summary, DC is trying to build a layer of intuitively searchable metadata on top of the base documents.

Watching people trying to capture what a speaker is talking about through the #tt09 Twitter channel is painful.

11:42:25am: Aron Pilhofer (NYTimes) & Jeremy Ashkenas on DocumentCloud. DocumentCloud is Pilhofer’s “hobby”, arising out of a conversation he had in his boss’ office. DC turns unstructured text into structured data. It comprises several tools for analysis, processing, publishing and searching/cataloging. Analysis: Aim is to derive meaning from the library of documents which anyone can upload into the cloud. Processing: Technical challenges to processing a lot of documents quickly – uses parallel processing (Mapreduce-type techniques?) to accomplish this. Publishing: making documents publicly accessible + presentation tools. Searching: by providing metadata. Focus today is on first 3 components.

11:33:10am: Hah, someone’s making the exact point in the Q&A session going on now that Andreas made. Irrelevant sidenote: Ellen Miller kinda reminds me of Angela Petrelli from Heroes.

11:29:46am: Twitter channel at #tt09 is filled with mostly non-useful chatter about who’s speaking, their topic and possibly some form of glowing compliment.

Sunlight Foundation is showing interesting visualizations of Congressional transcripts and the frequency with which politicians use different words. Text visualizations tend to be just word counts or single word/word-pair comparisons. Lots of word clouds. I’m not sure that’s very useful as you lose a lot of meaning, and also why isn’t a “top 10″ list of words (together with a frequency count) better than a word cloud? If I have a word in 92 pt font and another in 72 pt font, this doesn’t really confer any information about the relative frequency of those words. As Andreas mentioned, people are good at making things look pretty, but we need to take a long hard look at whether it is useful and more effective.

Conference website.

  11:23am (GMT -4.00)  •  Internet  •  Tweet This  •  Add a comment  • 

  stuloh is at the Transparent Text Symposium in Cambridge. #tt09 #fb

  7:10am  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09

  stuloh It's a little chilly here.

  5:21am  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09

  stuloh Orange. Why have 5 threat levels when the darn switch is stuck on the 2d highest? It's like being at Defcon 2 in some terrorist cold war.

  9:05pm  •  Tweet  •  Tweet This  •  Add a comment  • 

  stuloh is SFO-BOS.

  8:12pm  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09

  stuloh Why are there so many law students/lawyers/former lawyers on this season of Survivor??

  9:49pm  •  Tweet  •  Tweet This  •  Add a comment  • 
Sep 09
Sep 09

2017: Jan
2016: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2015: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2009: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2007: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2004: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2003: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2002: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1999: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1998: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec