Syllabus¶
This is meant as a course overview without every detail of what will be covered on a given week. For the most part, these topics contain equal weight and sophistication, for example, web scraping versus natural language processing, and may be switched around. Other topics, like understanding the design of programming languages, will be topics every week/day, but some weeks will be more emphasized than others.
- Week 1: Hello COMM 113/213
- Week 2: Python programming and the command-line
- Week 3: Design and debugging and data
- Week 4: More with APIs and building programs
- Week 5: The Web
- Week 6: Text as Data
- Week 7: More Text as Data
- Week 8: More Text as Data and Data Analysis
- Week 9: Data Analysis with Pandas (continued)
Week 1: Hello COMM 113/213¶
Monday: Hello, Journalism and Computation¶
For the first day, a mix of overview of concepts, course logistics, and an installation party.
Assignments¶
None, except to send me a response to a Google Form (about operating system, etc.)
Hands-on¶
- Download and install Anaconda for Python 3.6 and make sure
python --version
works on your Terminal. - Download and install the VS Code editor: https://code.visualstudio.com/
- Signup/Create accounts on:
- DarkSky API
- A new Google account specifically for usage in this class (e.g. dannguyencompciv2017@gmail.com)
- A new Google Chrome profile for this account (keep your personal and school accounts separate)
Lecture¶
- Welcome to COMM 113/213
- Who am I?
- https://www.propublica.org/people/dan-nguyen
- Dollars for Docs
- Course logistics
- What is programming?
- What are the hardest problems in computer science?
Wednesday: Hello, Computers and Programming¶
Hands-on agenda¶
- Signing up to Github, installing Git
- Setting up the Visual Code editor
- Setting up the Terminal/Powershell
Assignments¶
- Readings
- (all of which we will be reviewing, practicing next week):
- From Automate the Boring Stuff, chapters 1-5
Week 2: Python programming and the command-line¶
Wednesday agenda: Do one thing and do it well¶
- How to use your text editor and your command-line (keep it minimal!)
- Use your Tab key, constantly
- Why do people use the command-line?
- curl
- youtube-dl
- soundscrape
- Moving between Python scripts and iPython
Trying out homework:
Python programming fundamentals¶
- Introduction to the Python syntax and language features
- Simple data types (numbers, strings, booleans)
- Sequences (lists, dictionaries, tuples, ranges)
- Dot-notation and the concept of “everything is an object”
- Functions
- Imports and code re-use
- Loops
- Conditional branches
- How to execute Python programs
Week 3: Design and debugging and data¶
Assignments¶
I’m in the middle of porting/adding assignments to this repo:
https://github.com/compciv/homeworkhome
You can see a posting for the assignment:
It contains a skeleton file to build off from, as well as a test suite that you can run on your own in your system shell. This will be the workflow moving forward so you can know even before you turn anything in whether your code passes spec.
More info at the README:
https://github.com/compciv/homeworkhome/tree/master/stanford_headlinez_souped#test-suite
Readings¶
- Background material on using BeautifulSoup for HTML parsing
- Automate: Working with CSV files and JSON Data
- Hello Data De-Serialization with JSON and CSV – there will be a homework similar to this – just quick functions to write to give you practice/confidence in dealing with Python data structures/sequences. Just have to update the data samples for this year…
- PEP8 Style Guide - the Python style guide is pretty lengthy and just about impossible to memorize. However, it’s helpful to go through it once (and a few more times). It not only provides great practical tips on Python programming, but it gives you an insight to how much code-writing is influenced by humane design/considerations.
- How to write a spelling corrector
- How to Break News While You Sleep
- How a California Earthquake Becomes the News
- Did a Human or a Computer Write This?
Lecture topics¶
APIs
PEP 8 - The Style Guide for Python
PEP 20 - The Zen of Python
Error handling and debugging
Data fundamentals (to be covered more next week)
- What is data?
- What is binary?
- Why CSV/JSON/XML?
- Deserializing and serializing data with Python
- Reading and writing data with files files
Monday lecture: Basically, catchup¶
- More in-hands Python and computing practice
- What is
curl
andyoutube-dl
(command-line fun!) - Walkthrough the new assignment: stanford_headlinez / Week 3 / due 2018-01-23 23:59
- Catching up on system setup and Github etc.
Readings for next lecture¶
Don’t worry about fully understanding the content, just get your feet wet. Please at least do a skim – understanding the Internet and the web stack and scraping, etc. is something we have to do iteratively, even if it means slamming a bunch of knowledge early on (and reviewing it later!).
No quizzes, though you should take the Zen of Python to heart…
About the web:
- https://http.cat/
- https://github.com/alex/what-happens-when
- http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/
- http://www.compjour.org/tutorials/intro-to-the-web-inspector/
- http://www.compjour.org/tutorials/watching-traffic-network-panel/
Python stuff:
- https://www.python.org/dev/peps/pep-0020/
- http://2017.compciv.org/guide/exercises/python/hello-data-serialization.html
You don’t have to do these exercises (yet), but do the reading:
Week 4: More with APIs and building programs¶
For Wednesday¶
No homework due.
Readings¶
The Follower Factory: https://www.nytimes.com/interactive/2018/01/27/technology/social-media-bots.html
This story just came out this weekend in the New York Times. It’s pretty long – probably the longest mainstream media story I’ve ever read on the topic of programmatic bots – and it’s perhaps the best. Well-written, full of fantastic interactive graphics, and chock-full of details about how bot-makers try to fool people.
csv - reading and writing delimited text data: http://2017.compciv.org/guide/topics/python-standard-library/csv.html
This is a primer on using Python’s csv library, which we will be doing in Wednesday class.
Brief history of the CSV file http://blog.sqlizer.io/posts/csv-history/
Basically what the title says – pretty short article with some optional links to follow about the plain ol CSV text format.
Creating URL query strings in Python http://www.compciv.org/guides/python/how-tos/creating-proper-url-query-strings/
Again, assuming people have very little knowledge about what a URL actually is, I assume that even fewer folks know that the URL specification includes a syntax for serializing key-value pairs. It’s easier demonstrated than explained.
For Google Search, here’s a URL in which the “query” term is set to stanford: https://google.com/search?q=stanford
And here’s a URL in which Google is instructed to return only French-language results: https://www.google.com/search?q=stanford&lr=lang_fr
Can you guess the pattern/syntax/delimiters for specifying key-value pairs in the URL? We’ll be seeing a bunch of examples on Wednesday.
Wednesday lecture¶
Walkthrough this exercise of reusing someone else’s Haversine code for our own purposes (ignore the rest of the earthquake stuff):
https://github.com/compciv/project-stanford-quakebot/tree/master/steps/calc_geo_distance
But we’ll start out trying to set things up via the command-line, as that will be the practice going on forward (practice hitting Tab!)
In-class exploration: The NYT’s Follower Factory, why fake followers are being bought, and how hard really is it to detect them?
https://github.com/compciv/project-compciv-twitterfakes
Coding instructions here:
Understanding application program interfaces¶
- Why programs depend on APIs
- The purpose and motives of creating or not creating a public API
- How to read API documentation
Week 5: The Web¶
- Understanding HTTP
- HTML and Web page design
- Web scraping
Wednesday lecture (draft notes)¶
- Fake tweets are easy
- Understanding HTML via the web inspector
- How does Google Translate do on idioms, homophones, etc? How does it know what it currently knows?
- What does the Translate API look like on a Cloud account?
- Sample Python script that uses the Translate API
- Another example of an app: https://translate.kmh.zone/
- Why use an API over a dataset?
Monday lecture (more API and web-scraping stuff)¶
- Marshall Project’s “The Next to Die”: https://www.themarshallproject.org/next-to-die
- The Marshall Project teams up with local news outlets to track executions across America: http://www.niemanlab.org/2015/09/the-marshall-project-teams-up-with-local-news-outlets-to-track-executions-across-america/
- ProPublica API example: https://gist.github.com/dannguyen/04d6cceffb6351ad24f24869ee338a1c
- Understanding the earthquakes API: https://earthquake.usgs.gov/fdsnws/event/1/
- Creating URL query strings in Python: http://www.compciv.org/guides/python/how-tos/creating-proper-url-query-strings/
Readings for Wednesday¶
(Google Translate is one of several APIs)
Google’s AI translation system is approaching human-level accuracy
The Shallowness of Google Translate
- CompCiv: Sorting Python collections with the sorted method
- Automate Chapter 8: Reading and Writing Files: https://automatetheboringstuff.com/chapter8/
- Automate Chapter 9: Organizing Files: https://automatetheboringstuff.com/chapter9/
- Past assignment (not homework): Web-scraping the Texas Executed Offenders List
Week 7: More Text as Data¶
In-class coding¶
An example web-scraping project/question – How many current Congressmembers attended Stanford University?
Relevant resources¶
- unitedstates/congress-legislators repo: https://github.com/unitedstates/congress-legislators
- Current legislators as a CSV: https://theunitedstates.io/congress-legislators/legislators-current.csv
- unitedstates/images repo: https://github.com/unitedstates/images
- Sen. Feinstein’s Bioguide page: http://bioguide.congress.gov/scripts/biodisplay.pl?index=f000062
- Her Congress.gov page: https://www.congress.gov/member/dianne-feinstein/F000062
- On ProPublica’s Represent app: https://projects.propublica.org/represent/members/F000062
- Sample web app about Congress and Best Colleges: http://beta-congress-colleges-fun.s3-website-us-east-1.amazonaws.com/
- US News: Top 10 Colleges for Members of Congress: https://www.google.com/search?q=congress+who+attended+collees&oq=congress+who+attended+collees&aqs=chrome..69i57.4619j0j1&sourceid=chrome&ie=UTF-8
- WaPo: Where the Senate went to college – in one map https://www.washingtonpost.com/news/the-fix/wp/2015/01/30/where-the-senate-went-to-college-in-one-map/?utm_term=.da5f87406530
- HuffPo: Colleges That Produced The Most Members Of Congress: https://www.huffingtonpost.com/2014/02/19/colleges-members-of-congress-alumni_n_4818357.html
Code examples¶
How to read/write files in Python: http://www.compciv.org/guides/python/fileio/open-and-write-files/
Sample code for download_and_save(): https://github.com/compciv/homeworkhome/blob/master/txdeathrow_scraper/starter/setup_hw.py#L89
Tasks¶
- How to test if a filepath exists
- How to write text/bytes to a file
- How to open and read text/bytes from a file
- How to manage data from a CSV
Questions to ask¶
- What is the simplest computational way to figure out whether someone attended Stanford or not?
- How about graduated? How about attended/graduated from any other institution? Or served in the armed forces?
- Once figuring out this problem for just Senators, how hard is it to solve for Representatives? How about all Congress legislators in all of history?
Assignments¶
Due Feb. 21 Wednesday, 11:59 PM:
Due Feb 27, Tuesday, 11:59 PM:
- (several assignments dealing with CSV/JSON data TK)
- answers to txdeathrow_check
- answers to sortsequences
- answers to stanford_headlinez_souped
- JSON and CSV
- Why Unicode
- Why Emojis
- Regular expressions
- Creating structured data from noise
Week 8: More Text as Data and Data Analysis¶
Wednesday¶
Readings¶
Quasi-homework: Create a folder in your homework folder, named solid-serialization-skills
, and do the following data serialization exercises:
- http://2017.compciv.org/syllabus/assignments/homework/serials/usa-gov-analytics.html
- http://2017.compciv.org/syllabus/assignments/homework/serials/just-trump-tweets-csv.html
- http://2017.compciv.org/syllabus/assignments/homework/serials/trump-tweets-json.html
I won’t be testing you on correctness (some of the pages even have the answers!), but going forward, I’ll expect that you know what they cover. So I would actually try to solve them as if they were homework, and even if you start out with copy-pasting my answers, you commit to rewriting the code in your own style.
Try out visualization with Matplotlib¶
- Matplotlib Tutorial: Python Plotting https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
- Labri Tutorial of Matplotlib: https://www.labri.fr/perso/nrougier/teaching/matplotlib/
- How to make beautiful data visualizations in Python with matplotlib (also discusses pandas, which is an R-like library) http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/
Monday¶
Understanding regular expressions
- http://2017.compciv.org/guide/topics/regular-expressions/regex-early-overview.html
- Understanding regex with unstructured data (all unstructured data is noise)
- Example: Shakespeare’s works https://compciv.github.io/stash/matty.shakespeare.zip
- Example: Hillary Clinton’s emails
I didn’t post this on Thursday but I had meant for this to be the readings for this week:
Readings for Wednesday:¶
About Text:
- Always Bet on Text: https://graydon2.dreamwidth.org/193447.html
- Unicode: A story of corruption, connection, and smiling poo https://medium.com/@maggieshafer/unicode-a-story-of-corruption-connection-and-smiling-poo-598295e4af9d
About Regular Expressions:
- Here’s what ICT should really teach kids: how to do regular expressions https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions
- A Quick Intro (in general): http://2017.compciv.org/guide/topics/regular-expressions/regex-early-overview.html
- DataCamp Python Regular Expression Tutorial: https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- Interactive regex explorer https://regexone.com/
- From Automating the Boring Stuff: https://automatetheboringstuff.com/chapter7/
Re-read, review, try-out Peter Norvig’s spellcheck in Python: https://norvig.com/spell-correct.html
Week 9: Data Analysis with Pandas (continued)¶
Wednesday¶
Going to spend time going back to more advanced web scraping and understanding POST requests and forms.
Answers for the txdeathrow_scraper assignment
Readings¶
The disclosure form to fill out: https://www.ethics.senate.gov/public/index.cfm/files/serve?File_id=e7694798-2483-4354-a707-01d4f4da08c2
Information about U.S. Senate EIGA requirements: https://www.ethics.senate.gov/public/index.cfm/financialdisclosure?p=overview
Requests Advanced Usage: http://docs.python-requests.org/en/master/user/advanced/
About HTTP cookies: https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies
ProPublica’s Trump Administration Financial Disclosures (manual collection of executive branch disclosures): https://projects.propublica.org/graphics/trump-disclosures
Monday¶
Readings and in-class¶
We’ll use Ben Welsh’s/California Civic Data Coalition’s “First Python Notebook” lesson. We can skip a few chapters (mostly about the setup process – we don’t need to use virtualenv, but try to get through all of these:
About the Senate disclosure site¶
- https://www.opensecrets.org/news/2014/05/most-senators-file-financial-disclosures-electronically-sort-of/
- Example news story: http://www.vnews.com/SANDERS-MISSES-ANOTHER-FINANCIAL-REPORTING-DEADLINE-10072890
- Look up how data is reflected on the Senate website: https://efdsearch.senate.gov/
- Check out the OpenSecrets project: https://www.opensecrets.org/pfds../
About AI/algorithms in general¶
- Asking the right questions about AI by Yonatan Zunger: https://medium.com/@yonatanzunger/asking-the-right-questions-about-ai-7ed2d9820c48