DSAN 5000: Data Science and Analytics
Thursday, September 26, 2024
| node_id | label |
|---|---|
| 1 | Bulbasaur |
| 2 | Ivysaur |
| \(\vdots\) | \(\vdots\) |
| 9 | Blastoise |
| edge_id | source | target | weight |
|---|---|---|---|
| 0 | 1 | 2 | 16 |
| 1 | 2 | 3 | 32 |
| 2 | 4 | 5 | 16 |
| 3 | 5 | 6 | 36 |
| 4 | 7 | 8 | 16 |
| 5 | 8 | 9 | 36 |
requests and BeautifulSoup| What I Want To Do | Python Code I Can Use |
|---|---|
| Send an HTTP GET request | response = requests.get(url) |
| Send an HTTP POST request | response = requests.post(url, post_data) |
| Get just the plain HTML code (excluding headers, JS) returned by the request | html_str = response.text |
| Parse a string containing HTML code | soup = BeautifulSoup(html_str, 'html.parser') |
Get contents of all <xyz> tags in the parsed HTML |
xyz_elts = soup.find_all('xyz') |
Get contents of the first <xyz> tag in the parsed HTML |
xyz_elt = soup.find('xyz') |
| Get just the text (without formatting or tag info) contained in a given element | xyz_elt.text |
| Application | Should Be Endpoints | Shouldn’t Be Endpoints |
|---|---|---|
| Voting Machine | cast_vote() (End User), get_vote_totals() (Admin) |
get_vote(name), get_previous_vote() |
| Gaming Platform | get_points() (Anyone), add_points(), remove_points() (Game Companies) |
set_points() |
| Thermometer | view_temperature() |
release_mercury() |
| Canvas App for Georgetown | view_grades() (different for Students and Teachers) |
SQL Statement for Storing and Retrieving Grades in Georgetown DB |
get_users and get_friends) to derive answer to “Is User 5 friends with Pitbull?”
Key Principle: CRUD

| index | var_1 | var_2 | var_3 |
|---|---|---|---|
| A | val_A1 | val_A2 | val_A3 |
| B | val_B1 | val_B2 | val_B3 |
| C | val_C1 | val_C2 | val_C3 |
| RegEx | [A-Za-z0-9]+ | @ | [A-Za-z0-9.-]+ | \. | (com|org|edu) | Result: |
| String A | jj1088 | @ | georgetown | . | edu | Accept ✅ |
| String B | spammer | @ | fakesite!! | . | coolio | Reject ❌ |
z: Match lowercase z, a single timezz: Match two lowercase zs in a rowz{n}: Match n lowercase zs in a row[abc]: Match a, b, or c, a single time[A-Z]: Match one uppercase letter[0-9]: Match one numeric digit[A-Za-z0-9]: Match a single alphanumeric character[A-Za-z0-9]{n}: Match n alphanumeric charactersz*: Match lowercase z zero or more timesz+: Match lowercase z one or more timesz?: Match zero or one lowercase zsz* |
z+ |
z? |
z{3} |
|
|---|---|---|---|---|
"" |
✅ | ✅ | ||
"z" |
✅ | ✅ | ✅ | |
"zzz" |
✅ | ✅ | ✅ |
| RegEx | [(]? | [0-9]{3} | [)]? | [ -] | [0-9]{3}-[0-9]{4} | Result |
"202-687-1587" |
\(\varepsilon\) | 202 | \(\varepsilon\) | - | 687-1587 | Accept ✅ |
"(202) 687-1587" |
( | 202 | ) | 687-1587 | Accept ✅ | |
"2020687-1587" |
\(\varepsilon\) | 202 | \(\varepsilon\) | 0 | 687-1587 | Reject ❌ |
| Var1 | Var 2 | |
|---|---|---|
| Obs 1 | Val 1 | Val 2 |
| Obs 2 | Val 3 | Val 4 |
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It (Denny and Spirling 2018) (PDF Link)




| doc_id | text |
texts |
Kékkek |
voice |
|
|---|---|---|---|---|---|
| 0 | 0 | 6 | 0 | 1 | |
| 1 | 0 | 0 | 3 | 1 | |
| 2 | 6 | 0 | 0 | 0 |
| doc_id | text |
kekkek |
voice |
||
|---|---|---|---|---|---|
| 0 | 6 | 0 | 1 | ||
| 1 | 0 | 3 | 1 | ||
| 2 | 6 | 0 | 0 |
DSAN 5000 W05: Data Cleaning