New Forum

Visit the new forum at http://godelsmarket.com/bb

Thursday, March 26, 2015

Setup Your First MySQL Stock Database

Today we’re going to setup a MySQL database, populate it with some data, and run a simple query. More complicated stuff to come, but this is a great start for those who have never used a relationship database management system (RDBMS) before.

(Note: Sign up for the new Gödel’s Market Newsletter if you find the following useful and want more articles on useful information, including interviews and research.)

Some may say that learning to use an RDBMS takes too much effort. But, depending on how you’re doing your data storage right now, an RDBMS could easily help you organize better (aren’t those CSV/TXT/Excel files piling up?), give you results faster and even give you greater insight into your data.

For this article we’re going to use MySQL. It is among the most popular RDBMS’ out there, and it is open source and free. Many sites use it, including Facebook, which serves over 1 billion people tons and tons of data.

Steps:


  1. You’ll want to download a copy at the following link:
    http://dev.mysql.com/downloads/mysql/ (make sure to select the download appropriate for the operating system you are using).
     
  2. Installation should be fairly automated. Just click next and make sure you don’t install any “Ask Jeeves” adware…Make sure you do a full install which should include the MySQL Workbench. This will be our method for working with the MySQL Server.
     
  3. Next, grab some sample data from Yahoo. Go to your favorite stock ticker, and select “Historical Data” on the left hand side. Then go down to the bottom of the page and click on “Download to Spreadsheet”. Here’s a link for MSFT: http://finance.yahoo.com/q/hp?s=MSFT+Historical+Prices This will download a file called “table.csv” which is exactly what you want. If you open it up you’ll see your stock’s data (date, open, high, low, etc).
     
  4. Now open up MySQL Workbench CE 6.* (I’m using 6.2, but they should all be very similar).  You should have a database called “local”. Double click on that. If you need to setup a database you’ll want to create one locally. There will be a “root” user which should be fine for our purposes at the moment. Assign a password and connect.
     
  5. Now that you’re connected to the database, take a look at the left side. You’ll see something called “Schemas”. We need to create a new one for stocks. You can either right click below the existing Schemas and select “Create Schema” or you can go up to the top bar and click the icon “Create a new schema in the connected server.”
     
  6. For your new schema, name it something like “stocks”. Use the default “Collation” and click “Apply”. Click “Apply” again on the creation script and again if needed. Then click “Finish” and you should have a new schema to the left.
     
  7. Click on the drop down arrow to the left of your new schema. Right click on “Tables” and choose “Create Table…” For table name choose something like “historicaldata”. Go down to Column Name and double click below it to add a new column. You’ll want a column for every row in your “table.csv” file. So one column for “Date”, “Open”, “High”, “Low”, “Close”, “Volume”, “Adj Close”.

    For Datatypes, you’ll want “DATETIME” for “Date”. For “Volume” you’ll want to use “INT”. For the others you’ll want to use “DECIMAL(10,3)”. You can use other data types, but these work well enough for what we’re doing.
     
  8. Now right click on your new table “historicaldata” under “Tables” within your newly created schema on the left hand side. You’ll want to choose “Select Rows”. This will bring up all the data you currently have in the table, which should be none. Go to the middle portion where you’ll see some icons following “Result Grid”. Go to the “Export/Import” tag, and click on the “Import records from an external file” icon.

    This will let you choose a “CSV” file for import. Before you choose “table.csv” open it up and remove the first line with the headers (i.e. Date, Open, High, etc etc). If you don’t, this will also get imported (there are ways around this, but I’m keeping this intro simple). Now you can choose  your “table.csv” file for import. It’ll take a little bit to import and then should load up the data on screen.
     
  9. Now you can run your first query. Go up to the icon with “SQL” and a little plus on it. It should be right below “File” at the top. This should open a new tab for SQL queries. Type the following and press “CTRL” + “ENTER” or click the “Execute” icon (a little lightning bolt).

    SELECT * FROM stocks.historicaldata

    This should return everything from your table. Not very useful, but it’s a start. Next try something like:

    SELECT * FROM stocks.historicaldata WHERE date > ‘2015-03-01’

    Try:

    SELECT * FROM stocks.historicaldata WHERE open > close

    I’m sure you can come up with some more on your own! There will definitely be more on this in the future.
     
These steps are only the very beginning. You can load multitudes of stock data into the table you’ve created (you may want to add a “Symbol” field to differentiate new stock symbols you add). There are many ways to analyze the data, and if you want you can even analyze it externally (say, in Python, C#, Java or even Excel).

I’ll be posting more on how to continue on from here. In the meantime, sign up for the Gödel’s Market Newsletter. Information not available on the blog is sent out through the newsletter. Click here to sign up!

Best of luck.

5 comments:

  1. let me know when you decide to show how to download all edgar data into a db program

    ReplyDelete
  2. Very promising indeed....was wondering whether you can address the below points going forward:
    -database update automation via R script: how to run a daily task to update prices in the database
    -database maintenance for stock corporate actions like dividends and stock splits: you may want to check when a corporate action is due and back-adjust the price series accordingly or fully redownload the entire databse from time to time in order to have the adjusted prices series available
    -cloud solutions like Google Cloud SQL: how to host and maintain a cloud solution

    ReplyDelete
  3. @Michael That sounds interesting and useful. Putting on my list of topics.

    @pcavatore Thank you for the suggestions. I will make it a point to address all of these topics whether through the main site or the newsletter. Thank you again!

    ReplyDelete
  4. Do you have any comments on how to best structure the database when you have multiple stocks and multiple indicators you want to follow? I have always been confused by this. Say I want to keep the monthly historical data for 10 different stocks and 2 different indicators (the RSI, and 12 month MA, plus the closing price). I see a couple of options:

    1. For each indicator, you would have a separate table. So you would have an RSI table with 10 columns(one for each ticker) and "n" rows where each row would be a date, and the same concept for the 12m MA. Then when you wanted to combine the data you would join the tables.

    2. You would just have one giant table and the columns would be for example "GOOG Close" , "GOOG RSI", "GOOG MA", etc. for each ticker, and then each row would be a different date. So if you were tracking 3 variables (close, RSI, MA) and 10 stocks you would have 30 columns.

    Is one layout better than the other? What if you had 100 different indicators...is the answer still the same? I am thinking of this is terms of needing to run a backtest at a later date using historical RSI and MA data (or other data).

    ReplyDelete
  5. @Elrond Like you mentioned, there are multiple ways to do this. For just storing data on multiple symbols I have stored all symbols in a single table with great success. You'll want to put an index on the symbol name/id and you'll probably want to play with indexes on other fields you typically use (e.g. date).

    Some would suggest using an integer id for the symbol name instead of a char/varchar/nvarchar. Less readable, but somewhat less resource intensive for data access.

    For multiple indicators, I think the first layout you mentioned is better. It would take up more space (due to extra date and symbol id columns), but it would be easier to manage.

    The second would be faster if you were going to compare RSI vs MA at a single datetime point. You wouldn't have to join saving you processing/resources. I think it's kind of ugly...but if you have a set of data that you want to run over again and and again with different parameters, you'll save some time by avoiding the joins.

    Really, database structure comes down to what you plan on using the database for :) I like keeping things easily accessible for general storage. If a backtest I start running would benefit from a different structure (without taking too much time to setup) I'll create that structure and setup the appropriate indexes. Otherwise, I just stick with a semi-normalized design.

    ReplyDelete