Scrape Wikipedia to get SMI constituents with Python, then use Yahoo Finance to get SMI constituents historical data

As part of the course ‚Certified Financial Data Scientist‚ I wanted to get the SMI constituents for further use.

In order to get the weightings and ticker symbols of the SMI the following worked like a charm:

First Step: Use Wikipedia to get the SMI constituents

import pandas as pd
# remember to 'pip install wikipedia', as it isn't preloaded in anaconda for example
import wikipedia as wp

# scraping the wiki page, specifically the 'current constituents'
html = wp.page("Swiss_Market_Index#Current_constituents").html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

# this will save the SMI constituents table from wikipedia to a .csv file for further use
df.to_csv('SMI_constituents.csv')
print('saved SMI constituents to SMI_constituents.csv')

Note: the wikipedia page total does not equal 100%. It comes to 99.24%. Partners Group is in with 0%. And also the ranks are not correct in relation to weighting.

Second Step: Using Yahoo Finance to get the historic quotes for our SMI constituents

# import python data science and utility libraries
import os, sys, itertools, urllib, io
import datetime as dt
import pandas as pd
import pandas_datareader as dr
import numpy as np

# create data sub-directory inside the current directory - this is where we will be putting the smi constituents data
data_directory = './data'
if not os.path.exists(data_directory): os.makedirs(data_directory)

# Import the SMI constituents from our wikipedia scraped .csv file
smi_constituents = pd.read_table(r'SMI_constituents.csv', encoding="UTF-8", delimiter=',')
print(smi_constituents['Ticker'])

# Define ticker symbols of the SMI constituents to be retrieved from the Yahoo finance API: symbols = ['']
symbols = smi_constituents['Ticker'] 

# Determine and print the number of SMI constituents ticker symbols:
# determine number of input stock ticker symbols
no_input_stocks = len(symbols)

# print number of input stock ticker symboks
print(no_input_stocks)

# period we want to get price data for
start_date = '2000-01-01'
end_date = '2021-02-19'

#######
stock_data_price = dr.data.DataReader('^SSMI', data_source='yahoo', start=start_date, end=end_date)

# remove all columns other than the adjusted closing price
stock_data_price = stock_data_price.drop(columns=['High', 'Open', 'Low', 'Close', 'Volume'])

# Rename the adjusted closing price column according to the SMI:
stock_data_price = stock_data_price.rename(columns={'Adj Close': "^SSMI"})
#######

# iterate over distinct ticker symbols
for symbol in symbols:
    
    # retrieve market data of current ticker symbol
    symbol_data = dr.data.DataReader('' + str(symbol) + '.SW', data_source='yahoo', start=start_date, end=end_date)
    
    print('Retrieving ' + str(symbol) + '')
    
    # collect the adjusted daily closing price of current ticker symbol
    stock_data_price['' + str(symbol) + '.SW'] = symbol_data['Adj Close']

# Inspect the first 10 rows of the retrieved SMI constituents daily adjusted closing price data:
print(stock_data_price.head(10))

# define the filename of the data to be saved
filename = 'smi_daily_closing.csv'

# save retrieved data to local data directory
stock_data_price.to_csv(os.path.join(data_directory, filename), sep=';', encoding='utf-8')

Screenshot from command (Anaconda CMD) interface – note how Nestle, even though it existed, has no data for a few years. This shows that Yahoo Finance has some data gaps. It also shows why inspecting head and tail of data is useful to make sure bugs are caught and data is cleaned.

Screenshot from .csv file we created, opened in OpenOffice

Alternative 2nd Step: Collecting all available historic data from Yahoo and saving it to individual files named with ticker

# import python data science and utility libraries
import os, sys, itertools, urllib, io
import datetime as dt
import pandas as pd
import pandas_datareader as dr
import numpy as np

# create data sub-directory inside the current directory - this is where we will be putting the smi constituents data
data_directory = './data'
if not os.path.exists(data_directory): os.makedirs(data_directory)

# Import the SMI constituents from our wikipedia scraped .csv file
smi_constituents = pd.read_table(r'SMI_constituents.csv', encoding="UTF-8", delimiter=',')
print(smi_constituents['Ticker'])

# Define ticker symbols of the SMI constituents to be retrieved from the Yahoo finance API: symbols = ['']
symbols = smi_constituents['Ticker'] 

# Determine and print the number of SMI constituents ticker symbols:
# determine number of input stock ticker symbols
no_input_stocks = len(symbols)

# print number of input stock ticker symboks
print(no_input_stocks)

# period we want to get price data for
start_date = '2000-01-01'
end_date = '2021-02-19'

# iterate over distinct ticker symbols
for symbol in symbols:
    
    # retrieve market data of current ticker symbol
    symbol_data = dr.data.DataReader('' + str(symbol) + '.SW', data_source='yahoo', start=start_date, end=end_date)
    print('Retrieving ' + str(symbol) + '')
    # This will save each ticker / symbol to a separate file
    filename = '' + str(symbol) + '-data.csv'
    symbol_data.to_csv(os.path.join(data_directory, filename), sep=';', encoding='utf-8')

Questions or Feedback?

Feel free to drop me a line at contact@zuberbuehler-associates.ch

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s