A utility to parse the CSV files that can be exported from goodreads to allow for various statistical generation, data transformation, and possible data enrichment in future versions.
I love to read, and I love to rate, review and discuss books. Good Reads has been an excellent place for me to do that. Good Reads offers some nice statistics and graphs to look at your reading, but it didn't go deep enough for me.
Good Reads does offer an API into their data, but due to security it looked to be a real hassle to gain access to it, even for my own account's data, let alone anyone elses.
However they do allow you to export your data to CSV. Initially I tried to import the CSV into Google Sheets to try to enrich my data and maybe generate some graphs. I found this tedious and slow.
Since I'm a Java developer by trade, I decided to take a programmatic approach. I developed this application for my own purposes, but I was encouraged by some other folks to share my code in case they wanted to use it and/or contribute to it, so here we are. It is very customized to my own purposes, but I tried to write it in a way that it could be useable for others, or modified quickly/easily to be.
The basic steps are this:
- Export Data From Good Reads to CSV
- Parse Data into Book objects
- (Optional) Enrich the Data
- Run 1 or more reports against your books
This is largely all accomplished by a little command line application I created called BookParsing.
There are a lot of things I'd love to add to the application, but since I'm working on my own, I'm a bit limited to how much time I'm putting into it, as well as the limitiations of my abilities. Here are some of the things I'd like to add given time and/or skill:
- Export Data: Since the application has the ability to enrinch the data, it may be desireable to export the data back out for use elsewhere (say Google Sheets). As to what format(s) the data should take will depend on what the export will be used for. Currently I have no plans to do this until there is a good use case on my part. It may be desirable to export enriched books and/or reports about books.
- Cross-reference Data Enrichment: It might be useful if we can use one or more additional datasources to further enrich our data. For example: ISBN Database API perhaps. I need to investigate what (if any) additional data can be obtained/added to our books. This looks to require an account, and I'm not sure how I'd incorporate that without requiring the user to include their api key.
- Database backend: Right now every time the application runs it must read in, parse and enrich the data. This process is really quick, but it requires all of the enrichment be programatic, since all enrichment is lost as soon as the program completes. Saving the data out to a database (likely a RDBMS such as SQLite or MySQL) would allow any enrichment to be preserved for future runs.
- GUI Front End: I hate GUI. I haven't done anything GUI in nearly 8 years, and most of my work was using Java Swing. Some kind of web/java script front end is probably more fashionable at this point, but well out of my skillset. It'd be nice to have a file chooser to load data, mechanisms to manually enrich data: correct mistakes, fill in missing data, etc, and to generate pretty graphs. The data enrichment stuff probably requires the database backend as a pre-requisite, because it would be frustrating to spend a lot of time fixing your data only to lose that work between runs. However a file chooser for the input CSV and the graphing capabilities could be pretty useful with the application as it current exists now. These features would transform the application from something for personal use by me/other Java developers into something that could be run by any good reads user. If you're a GUI developer interested in helping with this please let me know.
Through a web browser go to the Good Reads Import/Export page and click the Export button at the top right corner. This will (after some period of time) offer you a link to downlad a CSV of your books.
As mentioned above, I created a basic command-line application called BookParsing. It currently is very hard coded to read in the CSV included as part of the project and run whatever reports I was last messing around with. A better main application will be necessary for use by any non Java developer.
This part is very basic. I created an interface ParsingService which for a given input retuns an output that is a type of Book.
The only implementation is the GoodReadsParsingService which is customized CSV parsing based on the known CSV order at the time. If/when that format changes, that impelmentation will need to be updated. This makes the code a bit brittle, but at least it's encapsulated in one place. I documented the expected format which I figured out mostly through trial and error.
I'm using an open source Java CSV parser called OpenCSV.
The parsing also attempts to enrich the data based on how I shelve my data. Ideally this should be moved into the Enrinchment service layer and done post-parsing, but some of the enrichment relies on the raw values from the CSV that would need to be passed to the various enrichers in addition to the book being enriched. More on this below.
This is probably the weakest feature so far. It is entirely data specific. In particular it's tied to how I shelve my books on Good Reads. In theory anyone can shelve there data in the same way, but that's a lot of work/high barrier for entry to using the application.
I created BookEnricher interface, with an AbstractBookEnricher for any shared functionality. I've created the following implementations:
- BacklogBookEnricher - Tracks which books you previously owned but hadn't read yet.
- ContributorGenderEnricher - Sets the author's gender based on the shelves a book is on.
- GenreEnricher - Sets the genre of a book based on the shelves a book is on.
- GraphicNovelEnricher - Sets the Format to Graphic Novel based on the shelves a book is on.
- ReadHistoryEnricher - Sets the years a book was read based on the shelves a book is on.
This is a list of shelves that can be added to data to take advantage of existing data enrichment code. It is likely not complete.
- author-female - Adding a book to this shelf combination with the ContributorGenderEnricher will set the author's Gender to Female
- author-male - Adding a book to this shelf in combination with the ContributorGenderEnricher will set the author's Gender to Male
- author-non_binary - Adding a book to this shelf in combination with the ContributorGenderEnricher will set the author's Gender to Non-Binary
- genre shelves - Set the book genre if one of the following is present.
- fantasy
- historical
- horror
- humor
- mystery
- nonfiction
- non-fiction
- romance
- steampunk
- science-fiction
- scifi
- sci-fi
- thriller
- graphic-novel - Adding a book to this shelf in combination with the GraphicNovelEnricher will set the Format to Graphic Novel
- manga - Adding a book to this shelf in combination with the GraphicNovelEnricher will set the Format to Graphic Novel
- own-backlog - Adding a book to these shelves in combination with the BacklogBookEnricher will track how many books you read for your horde.
- read-YYYY (ex: read-2016) - Adding a book to these shelves in combination with the ReadHistoryEnricher, re-read books can be set to have multiple years beyond what is determined from the read date field.
- read-before-goodreads - the ReadHistoryEnricher also supports tracking books you first read before tracking your data on goodreads. This is helpful when you don't know what year you first read a book.
Once your data has been parsed/enriched, you want to do something with it. I've created some basic interfaces, domain objects and services for the purposes of "Reporting". In other words spit out something that I find useful about my books. It was written in such a way to hopefully allow someone else to come along and extend/implement there own Reports and services to spit out things useful to them as well.
The core of the reporting are the domain objects that add one or Book objects to a Report.
- AbstractBookCountsYearToYearReport - Shared functionality to generate a report from one of the Map<?,Integer> count maps generated by a YearlyReport.
- AbstractMultipleYearReport - Shared functionality for creating a report comparing books across more than one year.
- AbstractReport - Shared Report functionality.
- AbstractYearToYearReport - Shared functionality for creating a tabular report comparing 2 or more YearlyReports.
- BookFormatYearToYearReport - Report comparing the BookFormats read between 2 or more years.
- BookGenreYearToYearReport - Report comparing the BookGenres read between 2 or more years.
- BookRatingsYearToYearReport - Report comparing the ratings given to books between 2 or more years.
- DecadeYearToYearReport - Report comparing the year of publication between 2 or more years of reading.
- GenderReport - Report about a specific Gender
- GenderYearToYearReport - Report comparing the Gender of the author's of books read between 2 or more years.
- GenreReport - Report about a specific Genre
- ReadingQuantityYearToYearReport - Report comparing the numer of books read/listened to between 2 or more years.
- YearlyReport - Report about a specific year.
So far there is one service interface: ReportService with an abstract implementation (AbstractReportService) for shared code.
I've created the following concrete implementations so far:
- AuthorCountsReportService - Reports your top N read Authors
- GenderReportService - Creates a GenderReport for each Gender you've read and outputs a genre-by-genre breakdown to allow you to evaluate your reading habbits from an author gender perspective.
- GenreReportService - Creates a GenreReport for each Genre you've read and outputs a genre-by-genre breakdown to allow you to evaluate your reading habbits from a genre perspective.
- YearlyReportService - Create a YearlyReport for the year(s) specified and output the results. Additionally all the generated YearlyReports will in turn generate several Year-to-Year reports.
This Read Me continues to be a work in progress. Hopefully it gives a decent overview of what's available. It should ideally be updated as new functionality/services/reports are added.