2 weeks of work

Last 1-2 weeks (starting from Nov 2016?) I’ve been working on the MS-SQL database setup at work. I spent a lot of time learning about databases

I had to use a plethora of resources to understand and setup a datawarehouse. Classes I took in the mean time, mostly through lynda.com (since its organized)

  • Foundation of Networking: Networking basics
  • Foundations of Programming: Databases
  • Implementing a datawarehouse with Microsoft SQL 2012
  • Microsoft SQL Server 2016 Essential Training
  • MS Access 2016 essentials
  • Installing and Administrating Microsoft SQL Server 2016
  • Foundations of CS: Digital information
  • Foundations of CS: The Internet
  • Week 5 of harvard CS50

A lot of this information I had already known though, but I decided to relearn it anyhow just to understand. I picked these courses after auditing some 200+ courses and finding relevant information. Some courses I only finished partway or skimmed through relevant parts that I cared about. All notes I put into app.classeur.io, although some files got rather big (500 mB of ram to load a page), something I need to work on later

In addition to these resources, I posted questions on stackoverflow and reddit so I can understand these concepts more

  • https://dba.stackexchange.com/questions/156214/query-results-from-a-table-from-a-csv-file-list
  • https://dba.stackexchange.com/questions/156069/what-do-vertical-lines-refer-to-in-database-diagram-view

Mostly, I followed datawarehouse models found on http://www.databaseanswers.org/ from Barry Williams

The one I picked specifically to model are these:

Understanding the datamodel came from here:

A synposis of things I learned about databases from my notes:

  • ETL 1- extract transform load. Basically, extracting data from 1 database, transforming it through SQL queries or selecting specific areas, and loading into another application
  • ACID - atomic, consistency, isolation, durability. Describes interaction between 2 databases. ATOMICITY - if one part fails, everything fails. CONSISTENCY - triggers and cascades for database changes in effect. ISOLATION - Interaction between 2 databases is closed off to other applications during data transfer. DURABILITY - Prevent crashes and errors
  • CRUD - create, read, update, delete. Permissions given to users on databases
  • OLAP - online analytical processing - part of BI (excel pivot tables). Basically, takes data, tags them in different viewmodes (multidimensional database) to identify trends. Mostly used in large corporations, analysis, and BI http://searchdatamanagement.techtarget.com/definition/OLAP
  • OLTP - onine transacation processing: image,1. Essentially, processing data with queries using 3NF.
  • 1NF no repeating values or repeating groups. 2NF no nonkey values based on part of composite key. 3NF - no nonkey values based on other non-key values
  • EER vs ER diagrams. ER just shows basic workflow (Table relationships) but EER describes those relationships as well https://stackoverflow.com/questions/9906017/difference-between-er-diagram-and-eer-diagram
  • Referential integrity - every value in one table column must exist in another table column

OLAP I will just use for pricing optimization (NAW.org book), OLTP is for managing data between all systems (catalog, ecommerce), etc

Current setup:

Progress

So far this is what I’ve completed

  • Initial designs of RDBMS in MS-access
  • Integration with Data export software and setup of MS-SQL instance and software
  • Strategy for acquiring data from web competitors (web scraping)

Objective

My objective with the datawarehouse to improve the following processes at work:

  • Version control of data. Specs are all over the place and our excel catalog and work software costly time and money to fix and manage.
  • Ability to quickly query off results. By inputting a list of model# and abstracting data like bar codes, etc I can quickly load that data into a PoS software (and add inventory counts after)
  • OLTP, by downloading updated list prices from manufacturer I can adjust prices in ~November and ~March when companies update their prices by 5% to match into ecommerce and AdobeInDesign so no netloss in profits

Future

I intend to integrate as many data sources as possible, mostly on the free-end

There’s 3 types of data sources: - Internal data (work software import) - Creating data (pricing and markups) - External data (web scraping)

I want to focus on managing data and abstracting external data in the future. Mostly, data for category landing pages (e.g. filter by manufacturer, voltage, etc) since the work software does not organize this easily. I outsourced an advanced excel formula but even then =TRIM(IFERROR(MID(E146,FIND(",",E146,FIND("HP",E146)-13)+1,(FIND(",",E146,FIND("HP",E146))-FIND(",",E146,FIND("HP",E146)-13))-1),"")) that one isn’t good enough

Also, the category layouts, I don’t know how I will implement a relational-database layout for this. Each category has different… filters for searching and data in those filters. It doesn’t change over time since its fixed.

So my goals are:

  • Frameworks for scraping data
  • Building relational database warehouse further

Resources

  • None

Vincent Tang

Comments