Idea #15: Beltway Analytics
I’m a few days late to dedicate today’s software idea to Independence Day, but in my opinion, it would still be a fitting tribute.
I’ve always been intensely interested in American politics. I read the news every day. I listen to NPR. I’ve even written letters to the editor of my newspaper and to my congressman.
But the government is so big, it’s difficult to keep track of what’s going on, and I think the public is very poorly informed.
For example, how much do you know about the senators from your state? How often do they vote according to party lines? How often do they sponsor their own legislation? What committees are they on? When they write legislation, does it usually pass? Do their bills attract a lot of unrelated amendments?
Those are the kinds of things that I’d like to know about my own representatives. But, for some reason, it’s very difficult to track down those kinds of statistics. Usually, data like that is only published by academics at universities. And even then, the data you can find is usually several years out of date.
I’d like to draw on my experience in the fields of computational linguistics and statistical analysis to develop a website with comprehensive information about what’s going on in our government, automatically updated and published in real time, and delivered in a format much like Google News.
Very few people (including most members of congress) have the time to read all of the legislation introduced into the congressional record. Most bills are hundreds and hundreds of pages long. Nobody wants to read that.
But a computer program could easily parse the text (which is available in the public domain) and identify some of the interesting tidbits. For example, in a piece of proposed legislation about gun control, it would be very interesting to identify amendments dealing primarily with road construction.
It would also be interesting to know whether a Supreme Court opinion written yesterday contained paraphrased passages from the bible, or from the transcripts of a United Nations session. Beyond the day-to-day events, I’d love to see a scatter-plot word-frequency chart comparing the distributions of word and phrases used during the official speeches of all American presidents. I’d also love to see which words and phrases were omitted by a particular news organization (like the Wall Street Journal) when reporting the same stories as other newspapers (like the New York Times).
I’d love to compare the text of a bill written last year (which failed) to the new version of the bill that passed this morning. How is the new version of the legislation different than the old version?
Which government officials are most likely to use canned phrases (like “death tax”, “cut and run”, “culture of corruption”, etc)? And when a politician uses those kinds of phrases, is he more likely to be on the winning side of a controversial bill than someone who refrains from using those kinds of catch phrases?
And although I consider myself to be a pretty liberal Democrat, I wouldn’t rig the system in any way. I’d just set up the algorithms and let it run. If the results weren’t pretty (for my own team), then they’d be published anyhow. I’d hope to attract people from both sides of the political spectrum to participate in the website, with forums and blogs to supplement the linguistics and statistical analysis (which would be central).
Market Analysis
I think a website like this would be very popular. With the possible exception of IT subject matter (“Python sux!! Ruby rulez!!”), politics is the most blogged-about topic on the internet. People get addicted. Easily.
Of all the ideas I’ve written about so far (and all of the ones I’ve conceived but haven’t yet written about), this is the only one where I’d feel totally comfortable relying entirely on advertising revenue to pay the bills and build a comfortable profit margin. Assuming a pay-per-click revenue model, and an average revenue stream of five cents per click, I’d need to generate about 100,000 clicks per week. And, assuming a click-through rate of 2%, I’d need total traffic of about (ignoring the weekends) a million page views per day to reach my target of $250,000 in annual revenue.
That’s a lot of page views, but it’s definitely doable for a popular site. Also, five cents is on the low-end for per-click revenue.
Technical Considerations
I’d only need to use a handful of NLP algorithms to provide a mountain of really useful information. Off the top of my head, I’d include:
- Statistically improbably phrases (aggregated at the sentence, paragraph, and document levels) for identifying thematic elements.
- Karp-Rabin analysis (using raw text, as well as morphologically normalized text), for detecting quoting and paraphrasing of documents.
- Agglomerative clustering (probably using k-means algorithm), for identifying congressional voting blocks, as well as individual senators who vote within (or deviate from) those blocks.
- A simple back-propagation ANN, to perform dimensionality-reduction and to generate self-organizing maps. These would be especially handy for creating visually-interesting representation of data from n-dimensional feature vectors.
- A general-purpose pivoting mechanism, so that different types of analysis could operate on many different kinds of variables within the system.
Beyond those algorithms, I’d also need to write a few parsers, to consume documents, annotate them with metadata, and generate feature-vectors. The feature vectors would have to be pretty general-purpose, so that they could be fed into the different analysis algorithms.
I’d also need a few spiders, for collecting new documents from known sources (house, senate, white house, supreme court, FindLaw.com, etc) and for feeding those documents into the appropriate parsers.
Pros:
- Although it’d definitely be a tough six-months worth of work, it’s in a field of my expertise, and I’d enjoy it very much.
- Financially, the numbers are pretty straightforward. I’d rely on ads (as well as some promotional products like t-shirts, bumper stickers, etc), but I don’t think the single revenue stream would be problematic.
- In addition to making good money doing interesting work, I’d also feel like I was doing something important. Contributing something to society. Not only would it be fun, it’s be a civics lesson!! That’d be a definite perk for me.
Cons:
- There just so much data out there. It would be difficult to gather enough of it to make a really compelling, and totally comprehensive, website. Without creating the impression that the data is comprehensive, I think it would lose a lot of value.
…
This is the fifteenth of 30 business ideas that I’ll be writing about over the course of 30 days. Some of them are only intended to generate enormous sums of money for their creator, while others might actually do something for the good of society (though the enormous sums of money would still be nice). One of these ideas will become a product over the next six months, and the foundation of my new software business.

July 7th, 2006 at 2:40 am
I like this idea a lot.
July 7th, 2006 at 4:27 am
Great idea! Back in the day there used to be Yahoo! Politics, where you could find every congressperson and see every bill they voted for and which direction, what their salary was, hometown and all kinds of other statistics in a very detailed way like their Finance site. Congress’s own site is terrible for finding this information. The Washington Post is the only site that does anything vaguely similar these days.
July 7th, 2006 at 6:16 am
Apart from just political stuff,
If you could come with a kind of frame work, where you just point this framework to data source and some mapping information (to get some sense out of data).
then this analytics could be applied to any thing. But the idea is wonderful, it will certainly empower citzens.
July 7th, 2006 at 7:05 am
Great idea, but it does sound like a tough cookie. Parsing text and trying to get the computer to make intelligent sense out of it is no easy feat! You must be _really_ good to expect v1.0 to be done in 6 months! :)
Assuming you can pull it off, how do you plan to rise above the copycats? If you are thinking “this is so difficult that not very many people cant do it”, think again. I agree it is perhaps more difficult than copying a blogsearch website, but still… If your only source of income is pageviews and ads, I think your idea analysis should address the copycat issue as well.
Good luck.
July 7th, 2006 at 8:54 am
The idea rocks. It’s really good but very hard to implement, I believe. I agree with anon, you must be extremely good to make a public beta in half a year. However, chances are this site will be popular.
By the way, you forgot one more advantage. It will be relatively easy to make a similar site for other countries, which means that you’ll be able to multiply the number of visitors several times at almost no cost.
July 7th, 2006 at 4:18 pm
How about Underbelt Analytics?
Simply tracking speeches of polititians in time ( bankers, CEOs etc. ) about never ending promises we hear everyday, that just do not turn out the way they were promised to.
Example please?:
Gerhard Schröder’s view of unemployment in Germany was in 1999 ( or so :-) during his campaign ( “If I do not cut the uneployment I do not deserve to be reelected…” ), compared to the sobering reality of his achievements and later campaigns for the same things – just like the Spiegel cover earlier this year…
Showing what strong words were used when… …just reflecting public opinion of the moment.
Showing that most of the politics is just averaging opinions and not really having an opinion about anything.
Showing the determination before the big day vs. wishy-washy excuses few years ( days? ) later…
Showing who is copying who’s speaches…
Showing who is consistent and who not over larger period of time than a discussion on TV…
Showing who is falling back on proven phrases…
Showing who has read too many books…
Maybe you could negotiate to buy the http://www.weathercock.com domain… :-)
Collecting from respected sources, newspapers, public corpora, so that everybody can verify that between the words of 1990 and 2000 are galaxies in distance. Transforming speech to text and having TV card tuned into all major news networks following everything -they- say.
I guess your work on masivelly parallel processing comes in handy here!
Pros:
- you can always sell to the other party
- you can sell exclusive rights just for a certain campaign
- you can reaaaaally personalize this! the victim will love you! ( pity billg is leaving us soon, we might not be able to collect all the material retrospectively )
- you are not doing anything illegal by repeating their own words in seemingly random order in time…. :-)
- storage and bandwidth is getting cheaper day by day ( there will be a lot to collect )
- useful for any nation/language
- you can follow any given topic
- worst case you do not sell anything, you will be bought to shut up
- you will be helping to make the world a better place
Cons:
( in cons order ASC )
- you might tend to read abc of democracy periodically searching for hidden meanings you might have missed when in college
- you might never take part in elections again
- you might get even more depressive
- you might get shot
If I had your skills and knowledge I would have started yeaterday. Unfortunately I am quite tired when I get home in the evening since I have to work on a CRUD app most of the day…
But… …if you get this going, I promise to do the usability testing for you! If I survive that, I will apply for the position of the sales lead.
Good luck!
July 13th, 2006 at 3:29 am
Hey, this reminds me of a site with a slightly similar approach in the UK (maybe the neural networks stuff is not as sophisticated as in your idea, but still): http://www.theyworkforyou.com/.
There’s a blog post on sitepoint.com with some background info and furter links.
“… a web app that allows UK citizens to view their local politicians performance as well as providing an interface for searching debates that have taken place in the House of Commons (the UK’s main political “forum”).
It works by re-packaging information from other online sources, such as Hansard (the transcript of all debates in Parliament) into a form real people might actually be interested in.”
June 12th, 2008 at 8:54 am
cricket free ringtones…
Similarly slots online poker stars net casino gratuites sans depot ganar dinero real online cingular free real ringtones…