Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethics of this research set #20

Open
robrwo opened this issue May 17, 2024 · 0 comments
Open

Ethics of this research set #20

robrwo opened this issue May 17, 2024 · 0 comments

Comments

@robrwo
Copy link

robrwo commented May 17, 2024

An organisation that I work for has been having problems with robots requesting images from their website for AI training. We've managed to contact one of the people operating the robots who said they were using this dataset, and claimed because https://github.com/google-research-datasets/conceptual-captions/blob/master/LICENSE says "The dataset may be freely used for any purpose" that they had the right to use these images.

The problem is that you are publishing a dataset of non-Google URLs:

  1. Google has no control of the hosted images, and they may be changed or removed or blocked, e.g. Some image links unavailable? #17.

  2. Google is not paying the hosting costs of these images. Organisations have to pay for bandwidth, CPU time, or even the number of requests.

    So every time a user of this dataset requests the image, somebody else pays for it. (This is incentive to block or remove the images, see no. 1).

  3. These images were added without the consent of the organisations, who have to pay costs of hosting (see no 2).

  4. The images were added without the consent of the copyright holders (who may be different from the server hosts).

  5. This dataset was created before 2018, before concerns about the use of images for AI training were common, and before protocols to disallow use of web-hosted media for machine learning existed.

  6. Many of the images URLs are hosted by stock photo agencies, and may not be licensed for machine-learning use. They may also regard the captions (which require human effort to write) as part of their intellectual property.

  7. Many of the images are on news websites, and were licensed from stock photo agencies, so may not be licensed for machine-learning use.

  8. Many of the photos are hosted outside of the USA, by organisations which are not based in the USA, so US "fair use" copyright exceptions do not apply.

It would have been ethical for Google to license copies of the images, and then host them as part of the dataset (but still publish the URLs where they originally came from).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant