Real-Time Bidding, First and Second-Price Auctions, and Transparency

There has been some debate recently on AdExchanger about the benefits of first price versus second price auctions. Esco Strong, the Director of Display Marketplace Strategy for Microsoft Advertising wrote an article that basically said second price auctions didn’t work well for single unit RTB auctions and we should get rid of them:

Comparatively, first-price auctions are competitions where there is no reduction in clearing price for the auction winner; instead, the winner simply acquires the good they have won by paying the price of their bid. The dynamics of this type of marketplace would become much more straightforward and predictable, enabling more parties to participate and experience stable results, as well as manage their businesses to a of set expectations that won’t require constant revision.

Then Jonathan Wolf, Chief Buying Officer for Criteo, wrote a response that disputed the claims made by Esco Strong:

While I am and remain a fan of Esco, and his piece was elegantly argued, I strongly disagree with it. As I see it, there are two options in building a long-term business: by pricing transparently, or by taking unfair advantage of your customers. Only the first seems sustainable to me.

He then went on to touch on some topics relating to first and second price auction mechanisms with some input from the Business Intelligence group at Criteo.

The problem with Esco Strong’s original article is that it shows a strong lack of understanding on some critical basics of auction theory and mechanism design. Things like the Bayes-Nash Equilibrium, Revenue Equivalence Theorem, and construction of optimal floor prices were not really mentioned for some reason.

So, what are you to do if you want to design an auction mechanism to participate in the RTB space? I’m glad you asked. The short answer is to use a sealed bid, second price auction with a revenue-optimal reserve price. This kind of auction is nothing new and has been around since Myerson’s crucial 1981 paper Optimal Auction Design.

Revenue Equivalence Theorem
Let’s say you want to sell something and you need to know which auction mechanism will provide you with the move revenue. There is a result in Auction Theory dating back to Vickery in 1961 (that Myerson generalized) that says if the auction has the following criteria:

  1. The bidder with the highest signal/valuation/whatever wins the auction.
  2. The bidder with the lowest signal/valuation/whatever expects zero surplus (i.e., they win nothing).
  3. All the bidders are risk-neutral.
  4. All bids are drawn from a strictly increasing and atomless distribution.

then your choice of auction mechanism will not have an impact on your revenue as the seller. As long as those conditions are met then your choice of mechanism is not relevant to your overall revenue.

Bid Shading and Revenue Volatility
So, why bother with any debate about first versus second-price auctions if the revenue to the seller is equivalent? Well, because there are some problems with using a first-price auction. First of all, bidders don’t have any reason to bid their true value and instead are more motivated to engage in bid shading, which just means bidding slightly less than you think the item will sell for. This kind of bidding strategy leads to more volatility in revenue for the seller, even though the long-term revenues would be equivalent by the Revenue Equivalence Theorem. Most of the people in the Finance and Controlling departments don’t really like increased volatility in the revenue, so it’s not so cool for them or anyone else involved on the business side to have to deal with it. Additionally, if you are on the buy side then you have to do all kinds of trickery and magic in order to figure out what the true market price is. It’s a waste of time and energy and it doesn’t make anybody happy.

Price Discrimination
One thing that was touched on by Jonathan Wolf was the use of what he called dynamic pricing, which is just a different way of saying price discrimination. In theory, price discrimination comes into play when you have a market for goods, the goods cannot be transferred easily or at all after being sold, and there is either only one place to get the goods or a few limited sources. In this way it is possible to charge different buyers different prices for the same item. There are different kinds of price discrimination that we encounter in our every day lives. For example, the concept of buying in bulk in order to pay a lower cost per unit is a standard example of so-called second degree price discrimination. If you have a monopoly on certain publishers and you are entering that inventory into an RTB system and you also have really fantastic data analysis skills, then you can take advantage of monopolistic effects and do things like first degree price discrimination, which basically means you charge the buyer exactly what highest price that they’re willing to pay. However, price discrimination may not be advisable as a long-term business strategy as the eventual result is that people would rather do business somewhere else than deal with your games. This leads to a discussion on . . .

Transparency and Accountability
Most of my experience comes from the financial services industry, and almost all of that was in doing consulting work for mutual funds and investment advisors in the US. This industry is heavily regulated (although it needs more if you ask me) and many of these regulations are designed to provide transparency and accountability. We will need the same two things in the RTB space as computational advertising evolves. Any company that is going to provide an RTB system for others to participate in should be as clear as possible about how their auction system works, the design of the mechanism, how any conflicts of interest are handled, and so on. A lot can be learned here from the transparency and accountability rules of the finance space. The more transparent you are with regard to your RTB system, the more comfortable people will be participating. When I was at the AdTrader Conference in Hamburg last month, one of the things I discussed with quite a few attendees was concern about and desire for transparency in RTB systems. I can tell you all that any sort of black box implementation just isn’t going to cut it. I spoke with people from all sides who were skeptical about participation in such an exchange. You will have to be open and transparent about your systems (to extent possible) if you want people to buy in and be comfortable doing business with you. This is not some new groundbreaking business philosophy, but some companies will be tempted to take advantage of the newness and lack of understanding about RTB systems in order to try to make a quick buck.

Would you buy or sell stocks on an exchange that didn’t provide transparent pricing information? I didn’t think so. So why would you try to sell people on a black box RTB auction system?

Conclusion
So, the key takeaway here is that even though first-price and second-price auctions generate equivalent revenue for sellers, first-price auctions come with a lot of baggage that doesn’t make sense to deal with. If you’re a buyer, then first-price auctions are a pain because you have to employ strategies for true price determination/bid shading, and it just adds complication. Additionally, if you are designing an RTB system, transparency and accountability are the name of the game. Be honest about how things work and the peace of mind your participants have will be rewarded with loyalty.

Posted in Computational Advertising | Leave a comment

MAC Addresses, UDIDs, and Privacy

There has been quite a bit of fuss in the days since Apple started rejecting apps that make use of the UDID. The deprecation was announced months ago, but the rejection started without warning and was a surprise to some. Firms that had been planning for the change typically already had multiple secondary solutions in place, many of which rely on using the Media Access Control (MAC) address from the wireless network interface controller (wireless NIC) on the device. There have since been complaints that this is just as much of a privacy problem as using the UDID that Apple banned access to (keep in mind, they still have access to it), but these complaints demonstrate a lack of understanding on what a MAC address is, why it exists, and most notably the fact that it is transmitted in a plainly-readable form that can be viewed by every other device on any network to which you are connected.

Some background on MAC addresses.

As noted before, MAC stands for Media Access Control, and it is a special identifier that is assigned by the manufacturer to every NIC inside electronic devices. In theory, every NIC in every device has a unique MAC address assigned to it. This has been the case for decades, so there is nothing new or revolutionary going on here. If you have a laptop with an ethernet NIC (where you plug in a cable to get on a network) and a wireless NIC (so that you can access wireless networks), and a Bluetooth controller (for keyboards, mice, etc.) then your laptop will have three distinct MAC addresses, one for each NIC. These MAC addresses are used to route the information from your device to other devices or points on the network. The MAC address is somewhat like a return address written on a letter, and due to its importance for network communication, it isn’t going anywhere any time soon. Any time you send any information over the network, say to access a web page, check your email, or any other task, this information is broken up into tiny chunks, and each chunk contains information about where it came from and where it’s going. Maybe a more concrete example would better illustrate how this works.

Say you want to mail a big book to a friend. If you used the same method to send the book that your electronic device uses to surf the web, then you wouldn’t just send one big book in one big box. What your computer/iPhone/whatever does is splits the book into stacks of say, 20 pages. Then it puts each stack of pages into its own envelope and writes your friend’s address and the return address (yours), so that your friend knows where the letter came from. It also writes a number on each envelope so that your friend can reassemble the pages inside the envelopes in the correct order and make sure that they received all the envelopes required to reconstruct the complete book. People are not shocked by the fact that anyone who happens to see one of the envelopes will also see the addressing information that you wrote on them.

This process works just the same when electronic devices communicate. The difference is that the address of the sender and recipient are not a name, street, city, and country. They are other identifiers, one of which is the MAC address of your NIC. In general these pieces of information can be seen by everyone who is on the same network as you, and in some cases even people outside your network. In order words, if you are surfing the web on your phone using a wireless network in a coffeeshop, then everyone else who is connected to that wireless network can see your MAC address, and you can see theirs too. It’s 100%, completely, public. It is not encrypted, anonymized, or otherwise abstracted. The MAC address from your device is broadcasted as clear as if you wrote it on a huge piece of paper and held it above your head. Everyone could easily read it, and they’d think you were crazy if you yelled at them for doing so. They would probably already think you were crazy for writing a MAC address on a huge piece of paper and holding it over your head, but you get my point.

The Privacy Problem

So then the question becomes, is making use of information someone has publicized actually a privacy problem? I am guessing that most people who are raising the privacy issue in relation to the usage MAC addresses don’t realize that the MAC address is being constantly broadcast by their device any time they are doing anything on any network. User privacy is important, and that’s why I fully advocate that companies making use of the MAC address should anonymize it first. By doing so they are voluntarily protecting the user.

From the opposite perspective, there is no way to easily disable or change a MAC address. You can pretend to have a different one, called MAC Spoofing, but it isn’t always easy or possible to do so. In that sense, anyone who is using a device connected to some network doesn’t really have a choice in regards to the visibility of their MAC address. The MAC address exists and has for decades, it’s pretty much required in order for most devices to communicate, and there is nothing you can easily do to get rid of it. In that sense, using the MAC address as a device identifier poses the same problems as the UDID.

The big problem here is something that is very common when technology and privacy intersect. People don’t understand the details of how the technology works, and for the most part they don’t really care. This causes people to use features that probably aren’t good for their privacy even though they shouldn’t. This ignorance also causes them to freak out about things that don’t really have as much of an impact on their privacy, because they don’t understand the technology and this lack of understanding results in fear. They will go along happily using their iPhone, checking in with Foursquare, which is cross-posted to their Facebook (or they check in with Facebook directly), and they tweet pictures with Instagram, and they think nothing of doing all of this. In reality, they are giving away an amazing amount of information about who they are, where they are, and what they are doing.

For example, if you are checking in with Facebook or Foursquare then people know where you are and where you aren’t. If you have personal photos on your (likely public) Facebook profile then someone can easily just go to wherever you are (since you checked in), wait for you to come out (since they have your photo), and follow you around. They could follow you home, wait until you leave again, see that you have checked in at work, and proceed to steal all your possessions. People are voluntarily sharing all these things, this data that could be formed into a very real threat, and people seem to be perfectly comfortable with that. Why? You could make the argument that people are comfortable sharing so much information because they are choosing to do so, but that’s not really a good argument if the choice is not a well-informed one. I would advise people to visit the Electronic Frontier Foundation (EFF), which has a specific page set up for social network privacy and security. As is often the case, the privacy argument all comes down to user education. They never knew anything about MAC addresses, and now they’re unhappy that there is some unique token that is tied to their device. They just didn’t know that it’s been this way all along.

Conclusion

So using UDIDs is not possible anymore, and that isn’t a bad thing. Many companies have switched to alternative methods, including using an anonymized version of the MAC address. The MAC address is a necessary part of communication between networked devices and is readable by all devices on the same network (and always has been), so taking the step to anonymize it (which every company should) is actually a step up for user privacy.

You don’t get to drive a car without license plates, you don’t get to send letters without writing an address on them, and you don’t really get to connect electronic devices to a network without a MAC address. You do get to choose if you will drive, write letters, or surf the web, and in these situations the user still does have the choice to opt-out. Whether or not they feel the risks outweigh the benefits is something only they can decide, but that decision should be an informed one. In the meantime, companies making use of the MAC address should keep in mind user privacy concerns, should always hash the MAC address so they are not accessing the original, and should take all possible steps to facilitate privacy and transparency for users.

Posted in Sci-Tech | Leave a comment

UDID is gone. So what?

The last week has seen quite a bit of commotion in the mobile world as Apple has started enforcing their long-awaited deprecation of the use of the UDID. Honestly, I’m not sure what all the fuss is about. This change was announced by Apple last summer, so everyone has had nearly a year to prepare for it. The general set of questions I’ve seen on the topic can be reduced to the following.

Why did Apple do this?

Most people would simply state the the reason is related to privacy concerns, but I think that’s the short and easy answer. Before we can go deeper into this question we need to consider how the UDID is typically used by app developers and mobile advertising firms.

The problem with using the UDID is that, as the name implies, it is an identifier that is unique to the device. In that sense you can think about it like a license plate number on a car. The industry best-practice for dealing with the UDID is to anonymize it before it gets sent anywhere. So let’s just say at this point that if you are using the UDID and you aren’t anonymizing it first, you’re part of the problem. Having an anonymized UDID basically akin to having some encoded version of a license plate number from a car. This encoding is one-way so even if you have the encoded version of the UDID, that doesn’t mean that you can just reverse the process to obtain the UDID itself. This process is called hashing and the MD5 and SHA-1 algorithms are the ones most commonly used for this.

So what does having this anonymized identifier get you as an app developer? Namely it allows you to see how many people have downloaded you app, how it is being used, and various other kinds of analytics that you may be interested in. For example, perhaps you would like to know how often people have deleted and then re-installed your app. You can find this out by sending the anonymized UDID every time your application is launched. There are plenty of useful things you can measure using this identifier from an app developer’s perspective.

The other large group that makes use of the UDID is mobile advertising firms. One obvious use for the identifier is something like frequency capping, which is just an industry term for making sure you don’t see the same ad over and over again. I think everyone is happy not to be bothered by the same ad, especially if it isn’t even relevant. Another use for the UDID is for services like conversion tracking. Say you are an advertiser and you want to run advertisements for your new app. You also would like to know how often people are clicking on your ad, and how often people who click on your ad are actually downloading the app, and how many of those people are actually running the app, and so forth. This kind of information can be obtained using the UDID. The UDID can also be used to do so-called targeted advertising, which basically looks at the devices who fit into some group and makes some guesses about other devices. In other words, users who have app A and app B installed tend to be more likely to click on advertisements for app C. This allows for better targeting of ads and provides more efficient allocation of resources for advertisers and more relevant advertising for users.

Now that the groundwork has been laid on UDIDs, how they’re used, and why they’re used, we can begin to have a discussion about privacy issues.

I cannot pretend to know what went on in the debates leading up to the decision to deprecate the UDID, but here is the basic idea as I see it. Many developers were not anonymizing the UDID before making use of it, and this is bad no matter how you slice it. Secondly, people tend to be uncomfortable with the idea of “being tracked” and don’t have a firm concept of what that always means. For example, if you know an anonymized version of my license plate number, and you can’t see original license plate numbers, then you really don’t have any useful information about me. If you see my anonymized license plate in different places on different days then you’ll be able to re-recognize me, but that doesn’t mean you know who I am or where I live. Lastly, the UDID is something that is hardware based and therefore can’t really be changed or deleted from the device. In other words, it’s too permanent. All that being said, Apple didn’t want to be seen as making it easy for people to “be tracked.” I don’t blame them for that, but deprecating the UDID doesn’t change anything, as we’ll see.

Potential Replacements

Okay, so apparently the gospel says that using the UDID is bad. What happens next then? Well there are quite a few proposed solutions, some of which are below in no particular order.

  • OpenUDID is a solution that came out last year and generates a unique token for the device that is stored in the UIPasteboard and consequently is available to all apps. In this way it still becomes a unique identifier for the device, which doesn’t solve any of the privacy problems mentioned earlier. There is some opt-out functionality present, but the fact remains that it’s still one identifier that goes back to one device.
  • ODIN-1 uses the Media Access Control (MAC) address from the wireless network chip inside the device in order to generate a unique token for the device. In this way it is really no different than using the UDID as it is still a hardware-based identification system and therefore is difficult to change.
  • SecureUDID is an effort by Crashlytics to produce their own UDID replacement. Apparently there is a bit of a scuffle between the SecureUDID people and the OpenUDID people over who contributed to which technology and when. In the end, the result is that you still have a token that you can use to differentiate between devices, but the problem is that the token is generated on a per-domain basis. In other words, a publisher of multiple apps will be able to use the same token in all their apps, but another publisher will not be able to see the same token. If you are an ad network and you process ad requests from multiple apps, then you will not be able to tell that two requests from different apps actually came from the same device except under very specific circumstances. In the end, this solution is a non-starter in cases where you need to be able to identify the same device across apps from different publishers.

There are other solutions out there, but the last one brings us to the key issue in the matter. The only way for the ecosystem to perpetuate is with one unique identifier per device, and that’s the point that is at odds with the privacy argument. The reason that there are so many apps available for iOS devices is because that can be a profitable endeavor. The app developer will try to make money by selling their app for some price, or by having some in-app advertising that generates revenue for them. This is why apps are free. If Apple started rejecting every app that was able to identify a device uniquely, people would be forced to develop for other platforms, and that would be horrible for Apple. Imagine having an iPhone or an iPad without any apps. So Apple has to walk a fine line now by addressing concerns of privacy advocates and also addressing concerns of those who make a living by developing apps on their platform.

What next?

I think most people in the industry will move towards using the anonymized MAC address. This is the easiest change to make and the one that is most likely to be done by most industry players. This also allows for a device-specific identifier and therefore all of the previous products and services that were using the UDID can convert to using the MAC address without issue. The problem with this approach is that it is fundamentally no different from using the UDID and therefore likely to receive push-back from Apple at some point. In the end there will (hopefully) be an industry-wide solution that provides the required level of identification on a per-device basis, but is also something that can be cleared by the user. I think that will be the best solution for everyone involved.

Additionally, mobile websites and app developers should take additional steps to clearly communicate to their users what kind of information is being collected, why it being collected, how it will be handled, and so forth. Additionally, opt-out capability is something that will need to become the rule rather than the exception. This doesn’t have to be anything complicated, and could be as simple as app developers displaying a Terms of Use window to the user when the app is run for the first time. These terms should be very clear on anything related to data collection, use, and storage, and allow the user to choose not to provide any data at all. At that point it will be up to the developer/publisher to decide if they want to allow the user to continue to use the app or if they want to make the use contingent on the ability to display ads. That’s how they’re making money from their trade.

Conclusion

So the UDID is gone now. So what? This was announced months ago, and in the end it doesn’t really change anything. The need to uniquely identify devices is greater than ever, and the number of apps in the iOS App Store is rapidly growing (over 500,000 at the moment). Apple won’t risk alienating their developers, who are an integral part of the reason that people buy iOS devices in the first place (There’s an App For That(tm)). The goal then is to be as transparent as possible and move toward solutions that allow users to opt out of data collection, even though there should be no personally-identifiable information being collected anyway. That will be the only solution to the problem of balancing the need to uniquely identify a device and also respect the privacy and desires of users. The downside may be that app developers will only allow their apps to be used by those that opt in since they need to put food on the table. Just like there is no free lunch, there are no free apps.

Disclaimer: The opinions presented here are my own and are not in any way endorsed by any other entity.

Posted in Sci-Tech | Tagged , , , , , | 2 Comments

Installing GHC 7.4.1 and Cabal

I wanted to try out the newest version of GHC and decided it may be a good time to get rid of all the things install by the Haskell Platform, and any associated libraries, and start from scratch. Installing GHC 7.4.1 was easy because there is a MacOS binary available, but that leaves you with no package management via Cabal. You can download the Cabal-Install package, but instructions do not work for GHC 7.4 because the version of Cabal-install that is on Hackage is too old to be used with the newest GHC. After much fiddling with various cabal configuration files to try and get around the maximum version caps on various packages (I hate this about Cabal/Hackage), I finally figured out that I needed the development version of cabal-install in order to get it to build properly. This code repository is freely available, but via darcs and not Git. Oh well. So after downloading cabal-install from the repository and building, everything will work without issue. Below are the precise steps I used.

Before we install the new GHC and Cabal, let’s remove all previous version of GHC, the platform, and any cabal libraries and files with a short shell script. Be advised that this script deletes everything. It shows no mercy.

#!/bin/bash
set -x
sudo rm -rf /Library/Frameworks/GHC.framework
sudo rm -rf /Library/Frameworks/HaskellPlatform.framework
sudo rm -rf /Library/Haskell
rm -rf ~/.cabal
rm -rf ~/.ghc
rm -rf ~/Library/Haskell
find /usr/bin /usr/local/bin -type l | \
  xargs -If sh -c '/bin/echo -n f /; readlink f' | \
  egrep '//Library/(Haskell|Frameworks/(GHC|HaskellPlatform).framework)' | \
  cut -f 1 -d ' ' > /tmp/hs-bin-links
sudo rm -f `cat /tmp/hs-bin-links`
  • Download and install the appropriate GHC binary for your platform, which in my case was MacOS.
  • Install darcs.
  • Pull the current version of the cabal source from the repository with darcs get –lazy http://darcs.haskell.org/cabal/
  • cd into the directory into which you cloned cabal, make the bootstrap.sh file executable (chmod +x bootstrap.sh), and run it.
  • Make sure ~/.cabal is in your $PATH.
  • If you want to use cabal-dev sandboxing, then run cabal-install –force-reinstalls cabal-dev. If you leave off the –forge-reinstalls you’ll get version mismatch complaints.

That’s it. Now you will have a full and clean install of GHC 7.4.1 and Cabal 1.14

Posted in Math and Science | Leave a comment

On “Big Data” and Spurious Correlations

I didn’t have time to mention it last week, but even though I am happy that the New York Times wrote an article on big data, I think the most interesting part was at the end:

Big Data has its perils, to be sure. With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.”

Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.”

The warning was embedded on the last page of a 3-page article, a mere 3 short paragraphs from the end. I understand that the piece was designed to be rather lighthearted and to focus more on job opportunities that are present in such a growing field, but more needs to be said about this peril of analysing very large sets of data.

Humans already have a long list of cognitive biases, which I call brain failures, that come up in our daily lives. These brain failures have become increasingly problematic along with the increase in access to information. Humans love to find similarities between things, and those pattern recognition skills are one thing that have allowed us to survive this long. If Ug the caveman ate berries and then Ug got sick, you would assume that the berries were the cause and therefore avoid them, thus potentially saving you from illness and death. In this way, by natural selection we have evolved to become very sensitive to correlations that do not exist or exist but do not have an effect on the situation under consideration. This is especially apparent in the financial sector, where spurious relationships abound. On the surface it can be pretty obvious that the correlations are spurious, but that doesn’t stop people from demonstrating that they are supported by data. For example, consider the Superbowl Indicator which says that if an original NFL team wins the Superbowl then stocks will rise in the coming year, and if not then they will fall. This already sounds pretty ridiculous, but consider that it also has a 79% accuracy rate. A perfect example of a spurious correlation.

There are other crazy correlative indicators of stock market prices, like the Sports Illustrated Swimsuit Issue Indicator, which says that the stock market will have above-average returns in years that an American model is on the cover of the Sports Illustrated Swimsuit issue.

The point is that as humans have the ability to amass and analyze larger and larger sets of data, they will increasingly discover correlations that are spurious, and data scientists or those who work with “Big Data” should be very aware of this problem. At this point the discipline is still occupied by those with a strong scientific and mathematical background and therefore already have some critical-thinking-based immunity to spurious correlations from training or past exposure, but as tools and techniques for data analysis become more accessible to the average person the problem of succumbing to spurious correlations will be more pronounced. I’m not scolding the New York Times for not putting that at the beginning of their article, but I think it would be good to put more emphasis on the careful analytical skills required in “Big Data” work.

Posted in Math and Science | Leave a comment

On Using PostgreSQL as a Datastore for R

Sometimes when working with large datasets, the files are too large to load into memory on a local machine and it becomes convenient to load the data into a database like PostgreSQL and then have R use that backend to draw the data from. This isn’t a particularly difficult thing to do, but there are a couple of small steps that are critical.

If you already have R and PostgreSQL installed from the standard packages available at their websites, then proceed. Otherwise, you can download them from the R Project for Statistical Computing and from the PostgreSQL website. When you install PostgreSQL, make sure that you install the JDBC database driver using the Stack Builder that launches after the PostgreSQL installation is complete.

Now that you have working R and PostgreSQL installations, you can install the RpgSQL library, which will allow R to interact with PostgreSQL.


install.packages('RpgSQL');

You must also complete a step that I missed the first time around, but is critical, which is to tell R where it can find the JDBC jar file that was installed by the Stack Builder. In order to tell R where the jar file is, you need to add this line to the the file ~/.RProfile if you are using all the default installation options on Mac OS X (tested with 10.6.8):


options(RpgSQL.JAR = "/library/postgresql/pgjdbc/postgresql-9.0-801.jdbc4.jar")

If you want to create the file and add the required option line all at once, you can copy and paste the following command in your terminal:


echo "options(RpgSQL.JAR = \"/library/postgresql/pgjdbc/postgresql-9.0-801.jdbc4.jar\")" > ~/.RProfile

Pay careful attention to the backslash characters ( \ ) that come before the quotation marks, they are required in order for the line that actually gets put into the file to have the quotation marks.

Now you can load the library and start querying against your PostgreSQL database right from R.


dbconnection<-dbConnect(pgSQL(), user = "postgres", password = "your_pgsql_password", dbname = "your_db_name", host = "localhost", port = "5432")

var1 = dbGetQuery(dbconnection, "select * from tablename;")

Using this method, you can work with extremely large datasets in PostgreSQL that will be unmanageable if you were trying to load them into memory. There are other solutions to working with such files in R, using things like memory-mapped files and the ff/bit libraries, so I encourage you to indulge any curiosities on the topic.

Posted in Math and Science | 1 Comment

Don’t Ship What Doesn’t Work

If involved in workflow engineering, technology-related or otherwise, do everyone a favor and don’t pass something to the next stage if the previous stage isn’t done properly. Shigeru Miyamoto, the creator of Mario and the Legend of Zelda knows this. Take his words to heart:

“A late game is only late until it ships. A bad game is bad until the end of time.”
–Shigeru Miyamoto

If you pass a project on to the next stage of the workflow when it isn’t perfect, eventually it will have to regress back and be repaired. Just save everyone the effort and make it correct right now.

Posted in Sci-Tech | Leave a comment

Downloading Financial Data from Yahoo Finance using Ruby and Yahoo Query Language

Here we provide a quick overview of how you can obtain various pieces of financial data from Yahoo Finance.

We use the Ruby Programming Language for this example, but you can use any language you want. Additionally we use Yahoo Query Language, which is similar to SQL but allows you to obtain and manipulate data from web services rather than a database. Using Ruby and YQL we can easily obtain data that can be analysed further. In fact, this technology is used by Flux Financial in the data analysis and forecasting process.

Here is a very simple example:


#!/usr/bin/env ruby

require 'json'
require 'net/http'

def get_yql_data(ticker)

url = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%3D%22#{ ticker }%22&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback="

resp = Net::HTTP.get_response(URI.parse(url))
result = JSON.parse(resp.body)

return result

end

puts JSON.pretty_generate(get_yql_data("YHOO"))

I’ll go through this code briefly and explain what is going on.


#!/usr/bin/env ruby

require 'json'
require 'net/http'

This just provides a path for the Ruby interpreter and includes the JSON gem so that we can parse the data that comes back from Yahoo, and the net/http gem so that we can send and receive the data to and from the Yahoo web server.


def get_yql_data(ticker)

url = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%3D%22#{ ticker }%22&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback="

resp = Net::HTTP.get_response(URI.parse(url))
result = JSON.parse(resp.body)

return result

end

There’s a lot going on in this section, so we’ll go through it line by line. The first line defines a function called get_yql_data, which takes ticker as an argument. In the next line we define a variable called url that holds the url for the Yahoo Finance data we want to obtain. You’ll notice that there is a #{ ticker } in the url variable, and this is where the ticker we provided will be substitued. So in the case of IBM, the url variable will have IBM instead of #{ ticker }.

Now we have the url variable constructed properly, so we know what URL to request from Yahoo. The next line does a couple of things all on the same line. First we take the url variable and we parse it using the URI.parse method. This does what you expect and formats url properly. Then that is fed into NET::HTTP.get_response which requests the url we’ve constructed and gets the response from Yahoo. After we have the result, we have to get it into a format we can work with, so we use JSON.parse to reformat the body of the response, which is obtained by resp.body (using the body method on the resp variable). If you want to see the raw response body from Yahoo, you can click here. Now we have a Ruby hash containing all of our information, which gets returned at the end of the function.


puts JSON.pretty_generate(get_yql_data("YHOO"))

In this section we have three things happening. First we call our get_yql_data function and supply the YHOO ticker. The result of that function (our data from Yahoo) then gets reformatted by JSON.pretty_generate so that it is more readable, and finally puts, well, puts the result on the screen.

Now the results we get from running this provide an enormous amount of information that can be used for all kinds of stock modeling and forecasting processes. This is what we get when we run the above script:


{
"query": {
"count": 1,
"created": "2011-07-30T14:53:09Z",
"lang": "en-US",
"results": {
"quote": {
"symbol": "YHOO",
"Ask": "13.88",
"AverageDailyVolume": "29985900",
"Bid": null,
"AskRealtime": "13.88",
"BidRealtime": "0.00",
"BookValue": "9.752",
"Change_PercentChange": "-0.40 - -2.96%",
"Change": "-0.40",
"Commission": null,
"ChangeRealtime": "-0.40",
"AfterHoursChangeRealtime": "N/A - N/A",
"DividendShare": "0.00",
"LastTradeDate": "7/29/2011",
"TradeDate": null,
"EarningsShare": "0.886",
"ErrorIndicationreturnedforsymbolchangedinvalid": null,
"EPSEstimateCurrentYear": "0.75",
"EPSEstimateNextYear": "0.87",
"EPSEstimateNextQuarter": "0.22",
"DaysLow": "13.04",
"DaysHigh": "14.07",
"YearLow": "12.94",
"YearHigh": "18.84",
"HoldingsGainPercent": "- - -",
"AnnualizedGain": null,
"HoldingsGain": null,
"HoldingsGainPercentRealtime": "N/A - N/A",
"HoldingsGainRealtime": null,
"MoreInfo": "cnsprmiIed",
"OrderBookRealtime": null,
"MarketCapitalization": "17.140B",
"MarketCapRealtime": null,
"EBITDA": "1.430B",
"ChangeFromYearLow": "+0.16",
"PercentChangeFromYearLow": "+1.24%",
"LastTradeRealtimeWithTime": "N/A - 13.10",
"ChangePercentRealtime": "N/A - -2.96%",
"ChangeFromYearHigh": "-5.74",
"PercebtChangeFromYearHigh": "-30.47%",
"LastTradeWithTime": "Jul 29 - 13.10",
"LastTradePriceOnly": "13.10",
"HighLimit": null,
"LowLimit": null,
"DaysRange": "13.04 - 14.07",
"DaysRangeRealtime": "N/A - N/A",
"FiftydayMovingAverage": "14.7353",
"TwoHundreddayMovingAverage": "16.1821",
"ChangeFromTwoHundreddayMovingAverage": "-3.0821",
"PercentChangeFromTwoHundreddayMovingAverage": "-19.05%",
"ChangeFromFiftydayMovingAverage": "-1.6353",
"PercentChangeFromFiftydayMovingAverage": "-11.10%",
"Name": "Yahoo! Inc.",
"Notes": null,
"Open": "13.85",
"PreviousClose": "13.50",
"PricePaid": null,
"ChangeinPercent": "-2.96%",
"PriceSales": "3.17",
"PriceBook": "1.38",
"ExDividendDate": null,
"PERatio": "15.24",
"DividendPayDate": null,
"PERatioRealtime": null,
"PEGRatio": "1.47",
"PriceEPSEstimateCurrentYear": "18.00",
"PriceEPSEstimateNextYear": "15.52",
"Symbol": "YHOO",
"SharesOwned": null,
"ShortRatio": "2.30",
"LastTradeTime": "4:00pm",
"TickerTrend": " ====== ",
"OneyrTargetPrice": "17.86",
"Volume": "67798408",
"HoldingsValue": null,
"HoldingsValueRealtime": null,
"YearRange": "12.94 - 18.84",
"DaysValueChange": "- - -2.96%",
"DaysValueChangeRealtime": "N/A - N/A",
"StockExchange": "NasdaqNM",
"DividendYield": null,
"PercentChange": "-2.96%"
}
}
}
}

This framework could be easily extended to obtain data for multiple stocks. From there, you can use your imagination regarding what to do with all the data.

Flux Financial uses a similar process to obtain data for thousands of stocks per day and feeds this data into other models in order to produce closing range forecasts.

If you are trying to obtain various kinds of data on stocks then hopefully you found this helpful, and it’s free to boot. Any questions can be asked in the comments or through our contact page.

Posted in Sci-Tech | Leave a comment

The Dismal Jobs Report, Economic Recovery, and Normalcy Bias

This was originally posted at Flux Financial on 7/8/11.

Well everyone is reacting with shock and amazement at the negative June jobs report, which showed that contrary to the 125,000 jobs that even the most pessimistic economists expected to be added to the economy, the actual number was in fact about 18,000. This isn’t surprising, as this recovery will take a long time. In fact, it will take longer than most people expect and that is due in part to something called Normalcy Bias.

Normalcy bias is a cognitive bias, or brain failure as I call them, that is of particular interest, given the global financial crash that occurred in 2008 and the ensuing recovery. The normalcy bias is a mental state that is entered into when faced with a disaster and leads to underestimation of the possibility that a disaster will occur, in addition to underestimating the possible effects of the disaster itself. The result of the normalcy bias is inadequate preparation for disasters because there is a propensity to believe that such a disaster has never occurred and therefore such a disaster will never occur. In addition, it makes coping with the disaster exceedingly difficult once it starts. Lastly, and perhaps most importantly, the normalcy bias causes people to interpret warnings as optimistically as possible by using ambiguities to conclude that the situation is not as serious as it seems. Consider the statement from Bruce McCain, chief investment strategist at Key Private Bank, who said “It was obviously a shock, although in retrospect, I don’t think we should be inordinately surprised by the report considering the weakness in the second quarter.” This is a textbook example of optimistic interpretation of information due to normalcy bias.

Another good example of the normalcy bias was seen during Hurricane Katrina. Even after mandatory evacuation orders were given, and the situation in New Orleans had become dire, there were thousands of people who refused to evacuate the city. Many people in New Orleans lost their lives because of their inability to overcome the normalcy bias. They were simply convinced that everything would be fine.

Overcoming the normalcy bias can be difficult, since there is a fine line between identifying and planning for potential issues and having a constant doomsday attitude. In addition, many financial disasters are difficult to predict and therefore accepting and adapting to conditions is more important than identifying potential problems and planning for them.

In the case of the economic recovery and the jobs report, normalcy bias causes people to underestimate the severity of the issue, and overestimate the speed with which a recovery will take place. When the normalcy bias is too prevalent, we get situations like the one that unfolded today wherein people who think a recovery is in full swing are hit square in the face with the stark realities of the situation.

The recovery will take time. A long time. Don’t succumb to the normalcy bias and conclude that things will return to normal with surprising expediency. Do all you can to benefit from the recovery as it occurs. In the meantime make sure you are investing in yourself, specifically your transferrable skills, so that when the economy is in full swing again you will be marketable to many employers.

Posted in Math and Science | Leave a comment

Today in History

On 22nd June, 1633 Galileo Galilei was found guilty of heresy (crimes against the church) for “holding as true the false doctrine taught by some that the sun is the center of the world.” In other words, the church was prosecuting him for saying that the Earth did not revolve around the Sun. For this he was required to “abjure, curse, and detest” those crimes and was sentenced to house arrest for the remainder of his life.

Posted in Math and Science | Leave a comment