Week 2: Nuts and Bolts for Data Science

DSAN 5000: Data Science and Analytics

Class Sessions
Author

Prof. Jeff and Prof. James

Published

Thursday, September 5, 2024

Open slides in new tab →

Computer fundamentals

A little basic computer science is very useful for all STEM fields!

Motivation

Understanding how computers work is crucial for data scientists

  • Efficient Coding: Proficiency in computer architecture helps optimize code for faster processing & memory management.
  • Algorithm Design: A grasp of hardware aids in designing algorithms tailored to the computer’s capabilities.
  • Data Handling: Efficient data storage, retrieval, & manipulation improve performance with large datasets.
  • Resource usage: Knowledge of system resources enables optimal utilization & scalability of compute power.
  • Problem Solving: Understanding hardware enables better debugging & identifying performance bottlenecks.
  • Collaboration: Effective communication with IT teams & hardware experts enhances cross-functional projects.
  • Career Versatility: Understanding opens doors to diverse roles, from machine learning to system optimization.
  • Continuous Learning: As technology evolves, foundational computer knowledge helps adapt to new tools

Note: These skills become very important in DSAN-6000 (big data & cloud computing)

Hardware

Physical components of a computer

Computer form factors

Computers come in many shapes & sizes, however, they’re all basically the same inside

Hardware components

  • Broadly speaking, the following are the fundamental components of all computers:

Computer hardware

  • Read over the following at home
  • In general, computers consist of several fundamental components, including
  • Central Processing Unit (CPU): Executes instructions, performs calculations, & manages tasks, acting as the computer’s brain.
  • Motherboard: Main circuit board connecting all components, providing communication & power distribution pathways.
  • Random Access Memory (RAM): Offers fast-access memory for active programs, enhancing multitasking & performance.
  • Storage Drives: Include Hard Disk Drives (HDDs) for high-capacity storage and Solid State Drives (SSDs) for faster data access.
  • Power Supply Unit (PSU): Converts and supplies power to components, ensuring stable operation.
  • Graphics Processing Unit (GPU): Handles graphical computations, vital for video rendering, gaming, and complex visuals.
  • Cooling System: Comprises fans, heat sinks, and sometimes liquid cooling to dissipate heat and prevent overheating.
  • Case/Chassis: Houses and protects components, facilitating airflow and accommodating expansion.
  • Input/Output Ports: Enable connection to external devices, such as USB, audio, video, and networking ports.
  • Optical Drive: Reads and writes optical discs like CDs, DVDs, or Blu-rays (optional in modern systems).
  • Expansion Slots: Allow adding extra components like graphics cards, sound cards, or network adapters.
  • Operating System: Software interface managing hardware resources, enabling software execution and user interaction.

Storage vs memory

  • Memory (RAM)
    • Short term data storage
    • FAST communication with CPU
    • Data vanishes when the computer is shut-off (short term memory)
  • Storage (hard-disk)
    • Long term data storage
    • SLOWER communication with CPU
    • Data exists even when the computer is shut-off (permanent storage)

File Systems

Where does data live?

  • All data lives in a file-system on a hard-disk somewhere, you CAN’T do data science without understanding file-systems!
  • A computer file system is a structured method for storing, organizing, & managing files.

Storage Types

  • HDD (older) and SSD (newer) are the current options for computer hard-disks.

Aside: Modern buried treasure

  • In 2012, James Howells threw away a hard drive during an office clear out
  • BitCoin was less valuable in 2012, and he forgot there were Bitcoins stored on the disk.
  • In 2022, the Bitcoin on the disk was worth an estimated 184 million dollars.
  • Howells plans to spend millions digging up a Newport landfill to find the lost hard drive.

Source: https://www.bbc.com/news/uk-wales-62381682

Overview

  • Paths & file-system familiarity is essential for accessing & moving data from servers
  • The file system is composed of directories (folders), programs, and files
    • The files contain data OR instructions for program creation
    • Files, programs, & folders have associated permissions to control user access

Directory tree

  • The Files, folders, & executables are organized in a hierarchical directory tree
  • The base of the tree is called the root directory
  • The root folder is denoted by / on Unix machines and \ on Windows machines

Linux directory tree

  • The following diagram shows the directory tree of a Linux computer

Paths

  • Paths are “addresses”, they let users navigate the file-system to locate files & folders
  • Paths can be either relative, i.e. a location relative to the current folder, OR absolute location relative to the root directory
  • The current working directory (CWD) is where you currently “are” in the tree.
  • On Unix, the CWD is denoted ./ & one level down is denoted ../ (closer to the root)
  • The slashes are reversed on windows \, but otherwise the concept is the same

File permissions

  • System administrators control how much access different users have with-in the file-structure.
    • Access is based on file permissions associated with a user’s Login ID
    • Computers keep a database of which user owns each files, & which users have permission to view, edit, & execute EACH file, folder, or program.
  • Understanding basic data security is a fundamental skill in most modern careers … you don’t want to be the careless person that leaves a software vulnerability and gets your company hacked

Unix file permissions

  • Unix file permission codes are numeric representations (octal) for read, write, execute permissions, assigned to owners, groups, and others, regulating file access and security.
  • If a user has authority, they can change file permissions with the chmod command

Common file permissions codes

  • The following are common permission options.
  • NOTE: For websites: files are ususually 644 and folders 755
    • This could be set with chmod 644 my_file.html
    • You can set all website files permssiosn with the following linux commands
      • for i in $(find _site -type f); do chmod 644 $i; done
      • for i in $(find _site -type d); do chmod 755 $i; done `

Super-users

  • Super-users have total control over the file-system, can view, edit, or execute anything.
    • A SuperUser is synonymous with root-user, means there is no restrictions on your power over the computer
  • Usually you are NOT a super-user and you need to coordinate with system administrators, who have super-user status, to set up and control access

“With great power comes great responsibility”
   - The Spider-Man’s Uncle

Linux command line

A brief introduction.

What is Linux?

  • Linux describes a family of operating systems (OS), similar to Windows or MacOS
  • The key difference is that Linux is a FREE and open-source operating system.
  • It has a Unix-like OS kernel originally created by Linus Torvalds in 1991.
  • It forms the core of various Linux-based operating systems (distributions) such as Ubuntu, CentOS, RedHat, Fedora, and more.
  • Linux is known for its stability, security, and flexibility.
  • Almost all of the worlds super-computers are Linux machines
  • Web-servers & AWS virtual machines are also often Linux (e.g. GU domains)

Linux key features (optional)

  • Linux offers a flexible and powerful platform for various computing needs, from personal use to enterprise-level systems.
  • Open Source: Linux’s source code is freely available, allowing users to modify, distribute, and contribute to its development.
  • Kernel: Linux serves as the core of the operating system, managing hardware resources, memory, and system processes.
  • Multiuser and Multitasking: Linux supports multiple users and concurrent tasks, enhancing efficiency.
  • Security: Linux’s design and permissions system offer robust security features, minimizing vulnerabilities.
  • Variety of Distributions: Different Linux distributions cater to diverse needs, from server systems to desktop environments.
  • Command Line Interface: Linux offers a powerful command line interface (CLI) for system management and administration.
  • Graphical User Interface: Most Linux distributions include GUI options, making it user-friendly for various users.
  • Software Repositories: Distributions provide software repositories for easy installation and updates of applications.
  • Networking: Linux is widely used for networking, powering servers, routers, and other network devices.
  • Customization: Users can customize various aspects of their Linux environment, adapting it to their preferences.
  • Server and Cloud Usage: Linux is a popular choice for web servers, cloud computing, & containerization platforms like Docker.
  • Community and Support: The Linux community provides extensive support, forums, and documentation resources.

Why learn the Linux command line?

  • Useful line on your resume
  • Intuitive framework and tool-set for computational sciences
  • Better understanding of system and network administration
  • Almost all of the worlds super-computers are Linux machines
  • Web-servers and AWS virtual machines are often Linux
  • More intuitive interfacing with hardware and software
  • Smoothly interact with GitHub without using a web browser or GUI
  • Smoothly switch between environments with Conda
  • Can “get inside” other computers via the ssh command

Example: Can “get inside” other computers via the ssh command

Interacting with the file-system

  • Option-1: Interact with the file system via a GUI (graphical user interface)
  • Option-2: Interact via a command line interface (CLI)
  • IMPORTANT: The Unix command line is actually more like a computer scripting language (e.g. python), known as shell scripting or bash. It has all of the familiar coding constructs (for-loops, while loops, if/then statements, … etc)
  • Hidden files: Files & folders that start with . are hidden from the GUI interface (e.g. ~.bash_profile)

Command line access options

  • Mac & Linux: MacOS is very similar to Linux, both have a built-in Unix CLI.
  • Windows terminal options:
    • Command prompt: A text-based interface to execute commands and perform tasks
      • NOT a Unix CLI, closer to MS-DOS, completely different command structure
    • Windows powershell: Windows PowerShell is an advanced command-line shell and scripting language for automation and system management.
      • NOT a Unix command line, but more “Unix-like” than command prompt
    • Anaconda powershell: Quasi Unix command structure but still quite different
    • Windows subsystem for Linux (WSL): (highly recommended)
      • True Linux experience from within Windows, more on this later

GU domains: Command line access

  • The GU domains web-servers are Linux, you can “get inside” the servers via a browser
    • You can also ssh inside from your laptop (more on this next week)
  • Note: that you are NOT inside your laptop here!! But rather the GU-domains server, which is just a REMOTE computer located somewhere else in the world (e.g. California or China).

Linux commands & variables

  • Everything we discuss on the coming slides applies to (1) Linux CLI, (2) WSL in Windows, (3) the MacOS CLI (although minor differences do exist)
  • Linux Commands
    • A Linux command generally follows the following structure:
    • command [options/flags] [arguments]
    • Command: The primary action or task that you want the command to perform.
    • Options/Flags: These are preceded by a hyphen - or double hyphen -- and modify the behavior of the command. They are usually optional.
    • Arguments: Targets or inputs for the command (files, directories, text, etc).
  • Linux Variables
    • Variables are typically denoted using uppercase letters & underscores, e.g. MY_VARIABLE. Values are assigned with variable_name=value
    • Use $ before the variable name to access its value, e.g $MY_VARIABLE.

Command example: ls

For example, let’s take the ls command and describe its structure with flags:

ls [options/flags] [arguments]

  • Command: ls stands for “list” and is used to list files and directories.
  • Options/Flags:
    • -l or --long: Display detailed information about files.
    • -a or --all: List all files, including hidden ones.
    • -h or --human-readable: Display file sizes in a human-readable format.
  • Arguments: These would be the directories or files you want to list.
    • For instance, ls -l /path/to/directory.
  • You can use multiple options and arguments with a command to customize its behavior. Always refer to the command’s manual or help documentation (usually accessible with man command or command --help) to understand all available options and how they affect the command’s behavior.

Viewing file content

  • more index.html: View the contents of index.html using the more command.
  • more page2.html: View the contents of “page2.html”.
  • less index.html: View the contents using the less command (press q to exit).
  • head index.html: Display the beginning lines of index.html.
  • tail index.html: Display the last lines of index.html.
  • tail -n 4 index.html: Display the last 4 lines of index.html.
  • grep 'Hello' index.html: Search for the string “Hello” in index.html.
  • Aside: good practice \(\rightarrow\) avoid using spaces in folder-names and files-names
    • My Folder \(\rightarrow\) My-Folder  OR  my_folder
    • Spaces require an escape symbol \ when writing the path My\ Folder

Changing the filesystem

  • mkdir: Make directory \(\rightarrow\) Creates a new directory. (e.g. mkdir my_folder)
  • rm: Remove files or directories \(\rightarrow\) Deletes files and folders.
    • WARNING: Be CAREFUL with rm, it’s irreversible (deletes file permanently)
    • RECOMMENDATION: (1) ALWAYS work in a folder that is automatically backed up to the cloud (e.g. Dropbox) (2) Push changes to Git-Hub regularly (secondary backup).
    • rm my_file: deletes file called my_file
    • rm -rf my_folder: deletes folder called my_folder (requires -r flag)
  • cp: Copy files or directories \(\rightarrow\) Duplicates files and folders.
  • mv: Move or rename files/directories \(\rightarrow\) Used for both moving and renaming.
  • cp ../index.html ./page3.html: Copy index.html one directory closer to root and rename it “page3.html”.
  • cp -r folder_1 folder_2 make a copy of a folder (requires recursive -r flag)
  • > page2.html: Create a blank file named “page2.html”.

Shell (bash) scripts

  • The command line is a scripted language, similar to Python!!!
  • In a shell script, you can place multiple Linux commands into a file to run sequentially
    • These are called shell (.sh) or bash scripts
    • Similar to python (.py), but with Linux commands, instead of python commands
    • You need to change the permissions to make the script executable chmod a+x my_script.sh
    • To run the script you use ./my_script.sh from within the relevant folder
  • Example: Simple example of a shell script
  • Be careful: This is advanced content, you should only create very simple scripts, unless you know what you are doing.
    • In particular, we highly recommend NOT USING the rm command in a shell script

Additional important commands (optional)

  • These commands are foundational for navigating, managing files, and interacting with a Linux system effectively.
  • touch: Create empty files or update timestamps \(\rightarrow\) Creates new empty files or modifies timestamps.
  • cat: Concatenate and display file contents \(\rightarrow\) Displays the content of a file in the terminal.
  • nano/vi: Text editors \(\rightarrow\) nanois user-friendly,vi` is powerful but has a steeper learning curve.
  • echo: Print text to the terminal or a file \(\rightarrow\) Displays text or variables in the terminal.
  • grep: Search for text patterns in files \(\rightarrow\) Searches for specific text patterns in files.
  • chmod: Change file permissions \(\rightarrow\) Modifies access permissions for files and directories.
  • chown: Change file ownership \(\rightarrow\) Changes the owner of files and directories.
  • ps: Process status \(\rightarrow\) Lists running processes.
  • top/htop: Monitor system resources.
    • top provides real-time process monitoring and htop is a more user-friendly alternative.
  • df: Disk space usage \(\rightarrow\) Shows available disk space on filesystems.
  • du: Disk usage of files and directories \(\rightarrow\) Displays the space used by specific files or directories.
  • wget/curl: Download files from the web \(\rightarrow\) wget and curl can download files from URLs.
  • tar: Compress and extract files \(\rightarrow\) Used for archiving and compressing files and directories.
  • ssh: Secure Shell \(\rightarrow\) Connects to remote servers securely.
  • sudo: Superuser do \(\rightarrow\) Executes commands with superuser privileges.
  • history: Show a history of commands entered in the terminal.

Aside: Command line editors (optional)

  • You may find yourself inside a server without GUI access \(\rightarrow\) use a command line editor
  • Nano is a popular command line editor for coding from the command line
    • e.g. nano index.html
  • Other popular options include emacs and vim (not recommended)

Additional reading (optional)

  • If you want to learn more, the following are popular books on the topic

HTML / CSS / JS

Motivation

Due to the internet, media consumption has changed dramatically over the last 30 years.

Traditional media

  • Journals, Magazines, academic articles, Billboards
  • Inherently Passive consumption of static content

Modern media

  • web pages, videos digital paper, electronic billboards
  • Inherently Active consumption of content with increased user engagement
  • Allows for data updates, modifications, interaction, animation, real-time visualization
  • Allows personalized, customizable data-driven visualization

The internet

  • The four key ingredients of the Web:
    • URLs: Uniform Resource Locators, page addresses to link to pages
    • HTML: HyperText Markup Language to write web pages
      • Pages are written in HTML (HyperText Markup Language), with CSS for styling options, and Javascript for interactivity
      • HTML has an easy way to link to another page with a special anchor tag (<a>).
      • <a href="http://npr.org/">news</a> creates a link for the anchor text “news,” which will cause the browser to fetch the HTML for the page
    • CSS & JavaScript: Formatting and scripting of web content
    • HTTP: HyperText transfer Protocol for web clients and servers to communicate

Web communication

  • The server is the computer where the content “lives”, on some hard-drive
  • Any program using the HTTP protocol to request content from Web servers is a client.
  • The browser is specialized software for rendering HTML,CSS,JS content.

Client side vs. server side

  • Scripts associated with a website can run in one of two places
    • Client side, also called the front-end
      • e.g. Your laptop
    • Server side, also called the back-end
      • e.g. The GU-domains server
    • Full-stack=Front-end+Back-end
  • The DSAN program is NOT a “web development” or “software engineering” program, however, many of the skills over-lap.
  • It is useful to understand HTML/CSS/JS at an intermediate level, which can increase your marketability

Webpages

  • The core technologies fundamental to all websites are HTML, CSS, Java-script (JS).
  • HTML: The markup language used to structure web content, e.g. paragraphs, headings, and data tables, or embedding images and videos in the page.
  • CSS: The language of style rules for customizing our HTML content, e.g. setting background colors, fonts, and laying content.
  • JavaScript: Scripting language that enables programmatic modification of content, control multimedia, animate images, and pretty much everything else.

Source: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/What_is_JavaScript

JavaScript

  • Many formats do not allow dynamic (interactive) content (e.g. png, jpeg, etc), however, html can be dynamically and programmatically updated
  • This modification is done via JavaScript (js), which dramatically expands the functionality of a html.
  • JavaScript runs after the webpage is loaded and facilitates interactivity.
  • It enables almost all of the advanced visualization libraries that we will discuss later

We won’t cover much Java-Script in the DSAN program, but will discuss it more in DSAN-5200 in the context of interactive data visualization

Front-end Dev Tools

HTML and DOM

Document object model (DOM)

HTML elements

  • Fundamental HTML building block
  • Start tag, content, end tag

HTML attributes

  • HTML attributes are added to the opening tag of an element to change the element’s default behavior.
  • Here we are modifying the \(<p>\) (paragraph) element with a unique identifier id attribute and changing the text-color using the style attribute.

HTML structure

  • An HTML document is a hierarchical tree-like collection of many HTML elements
  • HTML elements (objects) can have parents, grandparents, siblings, children, grandchildren, etc.

Document object model:

  • What is it? The Document Object Model (DOM) is a cross-platform and language-independent interface. It treats an XML or HTML document as a tree structure, where each node is an object, representing a part of the document. source
  • The DOM represents a document as a logical tree, this concept facilitates programmatic access and modification of the tree (add/modify/remove)
  • When an HTML page is loaded by a browser, it is converted to a hierarchical structure
  • HTML tags are converted into an objects in the DOM within the parent-child hierarchy

Lab Time!

Getting HTML onto the Internet

index.html
<!DOCTYPE html>
<html>
<head>
    <title>My Cool Webpage</title>
</head>
<body>
    <h1>Welcome to my Site!</h1>
    <p>I hope you enjoy all the amazing content in here.</p>
</body>
</html>

Getting Quarto onto the Internet

index.qmd
---
title: "My First Quarto Page!"
author: "DSAN Student"
format:
  html:
    df-print: kable
---

Hello welcome to my new Quarto webpage!

Python Coding Fundamentals

Types of Languages

  • Compiled
  • Interpreted

Primitive Types

  • Boolean (True or False)
  • Numbers (Integers, Decimals)
  • Strings
  • None

Stack and Heap

Let’s look at what happens, in the computer’s memory, when we run the following code:

Code
import datetime
import pandas as pd
country_df = pd.read_csv("assets/country_pop.csv")
pop_col = country_df['pop']
num_rows = len(country_df)
filled = all(~pd.isna(country_df))
alg_row = country_df.loc[country_df['name'] == "Algeria"]
num_cols = len(country_df.columns)
username = "Jeff"
cur_date = datetime.datetime.now()
i = 0
j = None
z = 314
country_df
name pop
0 Albania 2.8
1 Algeria 44.2
2 Angola 34.5

Algorithmic Thinking

  • What are the inputs?
  • What are the outputs?
  • Standard cases vs. edge cases
  • Adversarial development: brainstorm all of the ways an evil hacker might break your code!

Example: Finding An Item Within A List

  • Seems straightforward, right? Given a list l, and a value v, return the index of l which contains v
  • Corner cases galore…
  • What if l contains v more than once? What if it doesn’t contain v at all? What if l is None? What if v is None? What if l isn’t a list at all? What if v is itself a list?

Python: #1 Sanity-Preserving Tip!

  • (For our purposes) the answer to “what is Python?” is: an executable file that runs .py files!
    • e.g., we can run python mycode.py in Terminal/PowerShell
  • Everything else: pip, Jupyter, Pandas, etc., is an add-on to this basic functionality!

Code Blocks via Indentation

for i in range(5):
    print(i)
0
1
2
3
4
for i in range(5):
print(i)
  Cell In[3], line 2
    print(i)
    ^
IndentationError: expected an indented block after 'for' statement on line 1