DSAN 5000: Data Science and Analytics
Thursday, September 19, 2024
Command | What It Does |
---|---|
git clone |
Downloads a repo from the web to our local computer |
git init |
Creates a new, blank Git repository on our local computer (configuration/change-tracking stored in .git subfolder) |
git add |
Stages a file(s): Git will now track changes in this file(s) |
git reset |
Undoes a git add |
git status |
Shows currently staged files and their status (created, modified, deleted) |
git commit -m "message" |
“Saves” the current version of all staged files, ready to be pushed to a backup dir or remote server like GitHub |
git push |
Transmits local commits to remote server |
git pull |
Downloads commits from remote server to local computer |
git merge |
Merges remote versions of files with local versions |
How is data loaded? | Solution | Example | ||
---|---|---|---|---|
😊 | Easy | Data in HTML source | “View Source” | |
😐 | Medium | Data loaded dynamically via API | “View Source”, find API call, scrape programmatically | |
😳 | Hard | Data loaded dynamically [internally] via web framework | Use Selenium |
id | name | |
---|---|---|
0 | K. Desbrow | kd9@dailymail.com |
1 | D. Minall | dminall1@wired.com |
2 | C. Knight | ck2@microsoft.com |
3 | M. McCaffrey | mccaf4@nhs.uk |
year | month | points |
---|---|---|
2023 | Jan | 65 |
2023 | Feb | |
2023 | Mar | 42 |
2023 | Apr | 11 |
id | date | rating | num_rides |
---|---|---|---|
0 | 2023-01 | 0.75 | 45 |
0 | 2023-02 | 0.89 | 63 |
0 | 2023-03 | 0.97 | 7 |
1 | 2023-06 | 0.07 | 10 |
Source | Target | Weight |
---|---|---|
IGF2 | IGF1R | 1 |
IGF1R | TP53 | 2 |
TP53 | EGFR | 0.5 |
id | name | friends |
---|---|---|
1 | Purna | [2,3,4] |
2 | Jeff | [1,3,4,5,6] |
3 | James | [1,2,4,6] |
4 | Britt | [1,2,3] |
5 | Dr. Fauci | [2,6] |
6 | Pitbull | [2,5] |
Long story short…
user_id | name |
---|---|
1 | Purna |
2 | Jeff |
3 | James |
4 | Britt |
5 | Dr. Fauci |
6 | Pitbull |
id | friend_1 | friend_2 | id | friend_1 | friend_2 |
---|---|---|---|---|---|
1 | 1 | 2 | 6 | 2 | 5 |
2 | 1 | 3 | 7 | 2 | 6 |
3 | 1 | 4 | 8 | 3 | 4 |
4 | 2 | 3 | 9 | 3 | 6 |
5 | 2 | 4 | 10 | 5 | 6 |
.csv
: Comma-Separated Values.tsv
: Tab-Separated Values.json
: JavaScript Object Notation.xls
/.xlsx
: Excel format.dta
: Stata format.yml
: More human-readable alternative to JSON.csv
/ .tsv
👍
my_data.csv
→
pd.read_csv()
(from Pandas library)read_csv()
(from readr
library).json
json
(built-in library, import json
)jsonlite
(install.packages("jsonlite")
).json
file won’t load)How is data loaded? | Solution | Example | ||
---|---|---|---|---|
This section → | 😊 Easy | Data in HTML source | “View Source” | |
Next section → | 😐 Medium | Data loaded dynamically via API | “View Source”, find API call, scrape programmatically | |
Future weeks → | 😳 Hard | Data loaded dynamically [internally] via web framework | Use Selenium |
requests
and BeautifulSouprequests
Documentation | BeautifulSoup Documentation
import requests
# Perform request
response = requests.get("https://en.wikipedia.org/wiki/Data_science")
# Parse HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
all_header_elts = soup.find_all("h2")
all_header_text = [elt.text for elt in all_header_elts]
#section_headers = [h.find("span", {'class': 'mw-headline'}).text for h in all_headers[1:]]
print("\n".join(all_header_text))
Contents
Foundations
Etymology
Data science and data analysis
Data Science as an Academic Discipline
Cloud computing for data science
Ethical consideration in data science
See also
References
find_all()
is the key function for scraping!find_all()
can instantly parse this structure into a Python listfind_all()
X1 | X2 | X3 |
---|---|---|
1 | 3 | 5 |
2 | 4 | 6 |
Application Programming Interfaces: developer-facing part of data pipeline/service. Hides unnecessary details:
Example | Care about 🧐 | Don’t care about 🙅♂️ |
---|---|---|
Electrical outlet | Electricity | Details of Alternating/Direct Currents |
Water fountain | Water | Details of how it’s pumped into the fountain |
Car | Accelerate, brake, reverse | Details of combustion engine |
Exposes endpoints for use by developers, without requiring them to know the nuts and bolts of your pipeline/service:
Example | Endpoint | Not Exposed |
---|---|---|
Electrical outlet | Socket | Internal wiring |
Water fountain | Aerator | Water pump |
Car | Pedals, Steering wheel, etc. | Engine |
https://newton.vercel.app/api/v2/
factor
"x^2 - 1"
https://newton.vercel.app/api/v2/factor/x^2-1
Operation | API Endpoint | Result |
---|---|---|
Simplify | /simplify/2^2+2(2) |
8 |
Factor | /factor/x^2 + 2x |
x (x + 2) |
Derive | /derive/x^2+2x |
2 x + 2 |
Integrate | /integrate/x^2+2x |
1/3 x^3 + x^2 + C |
Find 0’s | /zeroes/x^2+2x |
[-2, 0] |
Find Tangent | /tangent/2|x^3 |
12 x + -16 |
Area Under Curve | /area/2:4|x^3 |
60 |
Cosine | /cos/pi |
-1 |
Sine | /sin/0 |
0 |
Tangent | /tan/0 |
0 |
PyGithub
PyGithub
Installation
Install using the following terminal/shell command [Documentation]
PyGithub
can handle authentication for you. Example: this private repo in my account does not show up unless the request is authenticated (via a Personal Access Token)1:
httr2
and xml2
httr2
Documentation | xml2
Documentation
[1] "Contents"
[2] "Foundations"
[3] "Etymology"
[4] "Data science and data analysis"
[5] "Data Science as an Academic Discipline"
[6] "Cloud computing for data science"
[7] "Ethical consideration in data science"
[8] "See also"
[9] "References"
'//h2'
is an XPath selectormypage.html
'//div'
matches all elements <div>
in the document:
'//div//img'
matches <img>
elements which are children of <div>
elements:
mypage.html
'//p[id="page-content"]'
matches <p>
elements with id page-content
1:
Matching classes is a bit trickier:
'//img[contains(concat(" ", normalize-space(@class), " "), " foot ")]'
matches <img>
elements with foot
as one of their classes2
https://newton.vercel.app/api/v2/
factor
"x^2 - 1"
https://newton.vercel.app/api/v2/factor/x^2-1
GH
GH
library for R
can handle this authentication process for you. For example, this private repo in my account does not show up if requested anonymously, but does show up if requested using GH
with a Personal Access Token:private-repo-test: Private repo example for DSAN5000
GITHUB_TOKEN
containing my Personal Access Token, which GH
then uses to make authenticated requests~/.zprofile
file: export GITHUB_TOKEN="<token goes here>"
DSAN 5000 W04: Data Formats and APIs