DSAN 5000: Data Science and Analytics
Thursday, September 26, 2024
requests
and BeautifulSoupWhat I Want To Do | Python Code I Can Use |
---|---|
Send an HTTP GET request | response = requests.get(url) |
Send an HTTP POST request | response = requests.post(url, post_data) |
Get just the plain HTML code (excluding headers, JS) returned by the request | html_str = response.text |
Parse a string containing HTML code | soup = BeautifulSoup(html_str, 'html.parser') |
Get contents of all <xyz> tags in the parsed HTML |
xyz_elts = soup.find_all('xyz') |
Get contents of the first <xyz> tag in the parsed HTML |
xyz_elt = soup.find('xyz') |
Get just the text (without formatting or tag info) contained in a given element | xyz_elt.text |
Application | Should Be Endpoints | Shouldn’t Be Endpoints |
---|---|---|
Voting Machine | cast_vote() (End User), get_vote_totals() (Admin) |
get_vote(name) , get_previous_vote() |
Gaming Platform | get_points() (Anyone), add_points() , remove_points() (Game Companies) |
set_points() |
Thermometer | view_temperature() |
release_mercury() |
Canvas App for Georgetown | view_grades() (different for Students and Teachers) |
SQL Statement for Storing and Retrieving Grades in Georgetown DB |
Key Principle: CRUD
index | var_1 | var_2 | var_3 |
---|---|---|---|
A | val_A1 | val_A2 | val_A3 |
B | val_B1 | val_B2 | val_B3 |
C | val_C1 | val_C2 | val_C3 |
RegEx | [A-Za-z0-9]+ | @ | [A-Za-z0-9.-]+ | \. | (com|org|edu) | Result: |
String A | jj1088 | @ | georgetown | . | edu | Accept ✅ |
String B | spammer | @ | fakesite!! | . | coolio | Reject ❌ |
z
: Match lowercase z
, a single timezz
: Match two lowercase z
s in a rowz{n}
: Match n
lowercase z
s in a row[abc]
: Match a
, b
, or c
, a single time[A-Z]
: Match one uppercase letter[0-9]
: Match one numeric digit[A-Za-z0-9]
: Match a single alphanumeric character[A-Za-z0-9]{n}
: Match n
alphanumeric charactersz*
: Match lowercase z
zero or more timesz+
: Match lowercase z
one or more timesz?
: Match zero or one lowercase z
sz* |
z+ |
z? |
z{3} |
|
---|---|---|---|---|
"" |
✅ | ✅ | ||
"z" |
✅ | ✅ | ✅ | |
"zzz" |
✅ | ✅ | ✅ |
RegEx | [(]? | [0-9]{3} | [)]? | [ -] | [0-9]{3}-[0-9]{4} | Result |
"202-687-1587" |
\(\varepsilon\) | 202 | \(\varepsilon\) | - | 687-1587 | Accept ✅ |
"(202) 687-1587" |
( | 202 | ) | 687-1587 | Accept ✅ | |
"2020687-1587" |
\(\varepsilon\) | 202 | \(\varepsilon\) | 0 | 687-1587 | Reject ❌ |
Var1 | Var 2 | |
---|---|---|
Obs 1 | Val 1 | Val 2 |
Obs 2 | Val 3 | Val 4 |
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It (Denny and Spirling 2018) (PDF Link)
doc_id | text |
texts |
Kékkek |
voice |
|
---|---|---|---|---|---|
0 | 0 | 6 | 0 | 1 | |
1 | 0 | 0 | 3 | 1 | |
2 | 6 | 0 | 0 | 0 |
doc_id | text |
kekkek |
voice |
||
---|---|---|---|---|---|
0 | 6 | 0 | 1 | ||
1 | 0 | 3 | 1 | ||
2 | 6 | 0 | 0 |
DSAN 5000 W05: Data Cleaning