Mining the Web for Indicative Data
I've spent the past day learning how to mine Yahoo Finance for its data, and I gotta hand it to Yahoo-- its website is quite mine-able. R has been my weapon of choice up to this point. The question has been the following: is there some way to leverage the huge databases stored on the web, explicitly (when you simply reference a pure data file in the proper format) or implicitly (when you are gathering the information by doing a raw scan of a page), in an other than pure quant setting? A variant of the results may prove helpful to pure quants too, but that admittedly wasn't the goal this time around.
R is an environment which is predominantly used for statistical computing and data visualization. It is remarkably robust and reminds me of how powerful and useful the open source movement is for guys like us. It's not even that I have a huge list of packages I can use in addition to the already expansive list of functions R in the raw can handle. It's that I can very easily pull up the code underlying the functions, looking into the guts of sometimes quite complicated functions.
In this case, the ability to pop the trunk on R packages is what saves a couple of the functions in the fBasics package-- yahooImport and keystatImport. yahooImport was created to pull price data on an arbitrary number of stocks over an arbitrary number of days from yahoo finance. keystatsImport was created to pull the stock's current key statistics in a similar fashion. To put it frankly, they don't work properly at all anymore because Yahoo has changed the way that it references information on its site. By having access to their code, it took a few hours to not only fix the bugs but improve on the functions. For instance, why limit oneself to pulling key statistics? Why not pull industry-related information? Why not gather more detailed financial statement information? Why not gather all this information is a scalable way so that I can see how 1000 or more stocks co-evolve? The cool thing about these functions is that they allow you to essentially have access to a huge library of information without needing to have that data on your computer at all. You just rely on the fact that yahoo data is stored in a structured format-- as long as that is true (and yahoo doesn't run bots to kill your queries), then all you really need is an internet connection and zap, you've got the data in a piece of software which can cut the crap out of it.
My only real issue with yahoo is that it's only really good at providing you snapshot statistics when you're dealing with anything but price. There's no way for me to extend the mining to look at the past 50 years, let alone the past 20 (or 10)-- I'd get killed by survivorship bias, the fact that indicative data changes with high likelihood over long time horizons, and the lack of more robustly historical financial statement data. This makes it harder for value investors to reap some of the potential benefits of yahoo mining; no historical perspective. There may be value in analyzing pure price data in a fashion similar to this, linking the daily price data for, say, 2000 stocks with their corresponding indicative data, leveraging some of the other features on the site, aggregating in interesting ways, maybe going intra-day... but at this point that's not really my game.
Time to learn how to navigate password protected websites I guess. If anyone has any advice on sites or potentially useful things to find out, I would love to hear them. I make it a habit to share my results with the people who contribute, so it won't be a one way street.
-Dan
R is an environment which is predominantly used for statistical computing and data visualization. It is remarkably robust and reminds me of how powerful and useful the open source movement is for guys like us. It's not even that I have a huge list of packages I can use in addition to the already expansive list of functions R in the raw can handle. It's that I can very easily pull up the code underlying the functions, looking into the guts of sometimes quite complicated functions.
In this case, the ability to pop the trunk on R packages is what saves a couple of the functions in the fBasics package-- yahooImport and keystatImport. yahooImport was created to pull price data on an arbitrary number of stocks over an arbitrary number of days from yahoo finance. keystatsImport was created to pull the stock's current key statistics in a similar fashion. To put it frankly, they don't work properly at all anymore because Yahoo has changed the way that it references information on its site. By having access to their code, it took a few hours to not only fix the bugs but improve on the functions. For instance, why limit oneself to pulling key statistics? Why not pull industry-related information? Why not gather more detailed financial statement information? Why not gather all this information is a scalable way so that I can see how 1000 or more stocks co-evolve? The cool thing about these functions is that they allow you to essentially have access to a huge library of information without needing to have that data on your computer at all. You just rely on the fact that yahoo data is stored in a structured format-- as long as that is true (and yahoo doesn't run bots to kill your queries), then all you really need is an internet connection and zap, you've got the data in a piece of software which can cut the crap out of it.
My only real issue with yahoo is that it's only really good at providing you snapshot statistics when you're dealing with anything but price. There's no way for me to extend the mining to look at the past 50 years, let alone the past 20 (or 10)-- I'd get killed by survivorship bias, the fact that indicative data changes with high likelihood over long time horizons, and the lack of more robustly historical financial statement data. This makes it harder for value investors to reap some of the potential benefits of yahoo mining; no historical perspective. There may be value in analyzing pure price data in a fashion similar to this, linking the daily price data for, say, 2000 stocks with their corresponding indicative data, leveraging some of the other features on the site, aggregating in interesting ways, maybe going intra-day... but at this point that's not really my game.
Time to learn how to navigate password protected websites I guess. If anyone has any advice on sites or potentially useful things to find out, I would love to hear them. I make it a habit to share my results with the people who contribute, so it won't be a one way street.
-Dan
0 Comments:
Post a Comment
<< Home